Seminar SS2020 - Visual Feature Learning in Autonomous Driving - Visual Feature Learning in Autonomous ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Seminar SS2020 - Visual Feature Learning in
Autonomous Driving
Erçelik, Emeç
April 2020
1 Introduction
In this seminar, we will be focusing on object detection and tracking problems.
Object detection tries to localize objects with their classes on 2D plane (front-
view or bird’s eye view) or in 3D space. This provides the distance of the
ego-vehicle to objects in the environment as well as their shapes, which is useful
for planning algorithms. Tracking is another aspect of the localization, in which
the position of specific objects are determined in subsequent frames assigning a
tracking ID to these objects. Therefore, these two tasks for autonomous driving
are very related to each other and two very basic problems for autonomous
driving function.
You will be providing a literature review with one of the topics in Section 2
in a group. The groups will be formed with minimum 2 people and maximum
3 people. Therefore, only some of the topics that receive most attention will
be assigned to the groups. I will ask you in the introduction session to vote
three of the topics according to you or your group would like to review most.
You can send you choices via email on the same day. If you can’t form a group,
the people assigned to the same topic will be a group and will be working on
the same report. The report is a 6-8 page document of your findings in IEEE
format.
Under the topics I provide a few references that I found useful for starting
to review these topics. You don’t have to include all the provided references
in your research report. Before making a decision on a topic, please read my
explanation on this topic and abstract, introduction, and conclusion sections
of the provided references to have a broad understanding of the topic. It is
better to start reading a paper with the abstract and introduction than directly
going into details of a paper. After the topic assignment, you can continue your
research with different methods. Here are some suggestions:
• Checking cited papers in the references section of the provided papers (To
find similar studies)
• Checking studies that cited the provided papers (To find more recent
studies, for example using Google Scholar)
1• Checking leaderboards of the public datasets
• Checking the conferences related to the topic, i.e. ICCV, CVPR, IV, ITSC
2 Topics
• 3D Object detection using different data modalities: (max 3 stu-
dents) For the autonomous driving function in real-world, one of the key
properties is the ability of 3D object detection, which provides precise loca-
tions and sizes of objects in the environment in order to do path planning
for future positions of the ego-vehicle. Due to the weaknesses of several
sensor types, it is argued to be better when data from several sensors are
fused to compensate each other. The fusion of sensors is useful when a few
of the sensors fail to provide proper information. In this seminar topic,
you are asked to review such challenges brought by the weaknesses of the
sensors, the sensor fusion methods for 3D object detection using different
modalities. Additionally, you will be providing your own ideas as solu-
tions to these challenges with evidences from the literature. You can find
some of the sensor types you can include below, but you are not limited
to these.
– LiDAR
– Radar
– RGB or gray-scale camera images
– Multi-camera images
– GPS / maps
– Review on sensor fusion types
References: [5, 11, 21, 12]
• Comparison of only LiDAR-based and fusion-based 3D object
detection methods: (max 2 students) 3D object detection mostly
relies on LiDAR sensors, which yield highly precise depth information
of objects in 3D space. Even though it has its own advantages, LiDAR
information is usually sparse comparing to camera images and doesn’t
work properly in different weather conditions as well as radar sensors.
With these disadvantages in mind, the state-of-the-art methods evaluated
on public datasets currently utilize only LiDAR points. In this seminar
topic, you are asked to compare methods using only LiDAR information
for detection with the sensor fusion methods for 3D object detection from
different perspectives by reviewing the literature. In the end, you will be
also providing your own ideas about which method to utilize for better
detections with the evidence from literature. Some of the points of view
might include:
– Advantages of one over each other
2– Safety concerns
– Accuracy of methods
– Possible saturation in results with the future improvements in the
architectures
– Extendability of architectures based on sensors
– Upcoming improvements in the sensors
References: [12, 11]
• Challenges in video object detection and tracking for 2D and
3D bounding boxes: (max 2 students) Detecting objects might be
challenging due to the low quality of sensor outputs at the time of sampling
(reflection etc.) and occluded objects behind others (parked cars, not fully
seen behind the first few cars). These problems can be relieved by using
time-series data such as videos, or sequences of LiDAR point clouds. In
this topic, you are asked to review recent studies about object tracking
for 2D and 3D bounding boxes and provide current challenges of these
problems with the solutions suggested in the literature. Additionally, you
are asked to provide your own solutions for these challenges considering
the experiments in the literature.
References: [1, 20, 15]
• Convolutional recurrent neural networks (RNNs) for object de-
tection and tracking: (max 2 students) RNNs are the most com-
monly used architecture types to keep track of important temporal features
to improve accuracy for the task using time-series data. In object detection
and tracking domains, using RNNs might get tricky due to multi-objects
in one scene. In this case, the matching problem arises, which aims at
aligning features from the same objects in time. However, convolutional
recurrent layers can be considered as a solution to this problem, which
processes feature maps of the entire scene instead of processing features
of objects separately in time. In this topic, you are asked to review the
usage of convolutional recurrent layers in the domain of object detection
and tracking. In the end, you will be also providing your own ideas for the
usage of these layers, usefulness depending on data type, and the feature
map type. Some of the aspects might be, but not limited to the ones
below.
– Advantages over linear RNNs
– Comparison on cell types (e.g. LSTMs vs. GRUs)
– Accuracies on different data types (camera images, radar data, Li-
DAR data)
– Comparison on the efficacy for object detection and for tracking
References: [25, 14, 19, 15]
3• Semi-supervised, self-supervised, and unsupervised learning meth-
ods for object detection: (max 3 students) Supervised learning is
the most commonly used learning method for training object detection
networks. However, the more complex data types are, the more difficult
and expensive become the labelling efforts. Therefore, there have been an
increasing interest in training networks with small labeled datasets aim-
ing to reach similar accuracies with the supervised learning. In this topic,
you are asked to review such methods for object detection especially in
3D space. Some aspects to be considered while reviewing might be the
following:
– Applicability to different data types
– Comparison of results of these methods with results of methods based
on supervised learning
– Differences in loss functions used for these training schemes
References: [22, 24, 17, 9, 18, 10]
• Transfer learning for 3D object detection: (max 3 students) Deep
neural networks are mostly known to be data hungry, which also applies
to the 3D object detection, for which most studies rely on large networks.
Even though collecting data from real-world environments is relatively
easy, synchronizing and calibrating multiple sensors, labeling the collected
data for the required tasks, and ensuring the diversity in the object classes,
weather conditions, illumination, road types are difficult, time-consuming,
and expensive. Therefore, there have been attempts to transfer the infor-
mation learned from the available datasets to the new datasets to decrease
efforts on preparing new datasets. Additionally, there have been virtual
datasets and simulation environments as a solution to this problem. In this
topic, you are asked to review performances of studies using the available
real-world and virtual datasets for transfer learning in 3D object detection
as well as available simulators. You will be also providing what kind of
sensors and diversity these datasets and simulators have to offer in ad-
dition to the your ideas for improving such knowledge transfer with the
evidence from literature.
References: [17, 23, 3]
3 Other Useful Resources
• Datasets: KITTI [8], nuScenes [4]
• Simulation environment: CARLA [6], GTA V [16]
• Metrics: Detection [13], Tracking [2], Keywords: Average Precision, In-
tersection over Union, MOTA, MOTP
• Survey: [7]
4References
[1] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking with-
out bells and whistles. Proceedings of the IEEE International Conference
on Computer Vision, 2019-Octob(Cmc):941–951, 2019.
[2] Keni Bernardin, Alexander Elbs, and Rainer Stiefelhagen. Multiple object
tracking performance metrics and evaluation in a smart room environment.
Sixth IEEE International Workshop on Visual Surveillance, in conjunction
with ECCV, 90:91, 2006.
[3] Åsmund Brekke, Fredrik Vatsendvik, and Frank Lindseth. Multimodal 3D
object detection from simulated pretraining. Communications in Computer
and Information Science, 1056 CCIS:102–113, 2019.
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin
Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Os-
car Beijbom. nuScenes: A multimodal dataset for autonomous driving.
(March), 2019.
[5] Andreas Danzer, Thomas Griebel, Martin Bach, and Klaus Dietmayer.
2D Car Detection in Radar Data with PointNets. 2019 IEEE Intelligent
Transportation Systems Conference, ITSC 2019, pages 61–66, 2019.
[6] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and
Vladlen Koltun. CARLA: An Open Urban Driving Simulator. (CoRL):1–
16, 2017.
[7] Di Feng, Christian Haase-Schutz, Lars Rosenbaum, Heinz Hertlein,
Claudius Glaser, Fabian Timm, Werner Wiesbeck, and Klaus Dietmayer.
Deep Multi-Modal Object Detection and Semantic Segmentation for Au-
tonomous Driving: Datasets, Methods, and Challenges. IEEE Transactions
on Intelligent Transportation Systems, pages 1–20, 2020.
[8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for au-
tonomous driving? the KITTI vision benchmark suite. Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pages 3354–3361, 2012.
[9] Tengda Han, Weidi Xie, and Andrew Zisserman. Video Representation
Learning by Dense Predictive Coding. pages 1483–1492, 2020.
[10] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Mo-
mentum Contrast for Unsupervised Visual Representation Learning. 2019.
[11] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L.
Waslander. Joint 3D Proposal Generation and Object Detection from View
Aggregation. IEEE International Conference on Intelligent Robots and
Systems, pages 5750–5757, 2018.
5[12] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang,
and Oscar Beijbom. Pointpillars: Fast encoders for object detection from
point clouds. Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2019-June:12689–12697, 2019.
[13] Lei Liu, Zongxu Pan, and Bin Lei. Learning a Rotation Invariant Detector
with Rotatable Bounding Box. 2017.
[14] Yongyi Lu, Cewu Lu, and Chi Keung Tang. Online Video Object De-
tection Using Association LSTM. Proceedings of the IEEE International
Conference on Computer Vision, 2017-Octob:2363–2371, 2017.
[15] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and Furious: Real Time
End-to-End 3D Detection, Tracking and Motion Forecasting with a Single
Convolutional Net. Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
[16] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun.
Playing for data: Ground truth from computer games. Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelli-
gence and Lecture Notes in Bioinformatics), 9906 LNCS:102–118, 2016.
[17] Yew Siang Tang and Gim Hee Lee. Transferable semi-supervised 3D ob-
ject detection from RGB-D data. Proceedings of the IEEE International
Conference on Computer Vision, 2019-Octob:1931–1940, 2019.
[18] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Multiview
Coding. jun 2019.
[19] Subarna Tripathi, Zachary C. Lipton, Serge Belongie, and Truong Nguyen.
Context matters: Refining object detection in video with recurrent neural
networks. British Machine Vision Conference 2016, BMVC 2016, 2016-
Septe:1–12, 2016.
[20] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H.S.
Torr. Fast online object tracking and segmentation: A unifying approach.
Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2019-June:1328–1338, 2019.
[21] Bin Yang, Ming Liang, and R. Urtasu. HDNET: Exploiting HD Maps for
3D Object Detection. CoRL, (CoRL):1–10, 2018.
[22] Zhaohui Yang, Miaojing Shi, Yannis Avrithis, Chao Xu, and Vittorio Fer-
rari. Training Object Detectors from Few Weakly-Labeled and Many Un-
labeled Images. (Mil), 2019.
[23] Fuxun Yu, Di Wang, Yinpeng Chen, Nikolaos Karianakis, Pei Yu, Dimitrios
Lymberopoulos, and Xiang Chen. Unsupervised Domain Adaptation for
Object Detection via Cross-Domain Semi-Supervised Learning. 2019.
6[24] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. SESS: Self-Ensembling Semi-
Supervised 3D Object Detection. 2019.
[25] Menglong Zhu and Mason Liu. Mobile Video Object Detection with
Temporally-Aware Feature Maps. Proceedings of the IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, pages 5686–
5695, 2018.
7You can also read