PROCEEDINGS OF SPIE Towards real-time monocular depth estimation for mobile systems - SisInfLab

Page created by William Castro

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

PROCEEDINGS OF SPIE
       SPIEDigitalLibrary.org/conference-proceedings-of-spie

                   Towards real-time monocular depth
                   estimation for mobile systems

                   Deldjoo, Yashar, Di Noia, Tommaso, Di Sciascio, Eugenio,
                   Pernisco, Gaetano, Renò, Vito, et al.

                                   Yashar Deldjoo, Tommaso Di Noia, Eugenio Di Sciascio, Gaetano Pernisco,
                                   Vito Renò, Ettore Stella, "Towards real-time monocular depth estimation for
                                   mobile systems," Proc. SPIE 11785, Multimodal Sensing and Artificial
                                   Intelligence: Technologies and Applications II, 117850J (20 June 2021); doi:
                                   10.1117/12.2596031

                                   Event: SPIE Optical Metrology, 2021, Online Only

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Towards Real-Time Monocular Depth Estimation For Mobile
                                        Systems
                  Yashar Deldjooa , Tommaso Di Noiaa , Eugenio Di Sciascioa , Gaetano Perniscoa* , Vito Renòb ,
                                                      and Ettore Stellab
                                        a
                                   Polytechnic University of Bari, Via Amendola 126/b, Bari, Italy
                   b
                       National Research Council Institute of Intelligent Industrial Technologies and Systems for
                                 Advanced Manufacturing, Via Giovanni Amendola, 122, Bari, Italy

                                                                                 ABSTRACT
                Nowadays, thanks to the development of new advanced driver-assistance systems (ADAS) able to help drivers
                in driving tasks, autonomous driving is becoming part of our lives. This massive development is mainly due to
                the great possibilities of guaranteeing higher safety levels that these systems can offer to vehicles that every day
                travel on our roads. At the heart of every application in this area, there stands the environment’s perception,
                guiding the vehicle’s behavior. This counts in autonomous driving field and all the applications characterized by
                a system that moves in the 3D real-world like robotics, augmented reality, etc. For this purpose, an effective 3D
                perception system is necessary to accurately localize all the objects that compose the scene and reconstruct it
                in a 3D model. This issue is often faced using LIDAR sensors, which allow an accurate 3D perception offering
                high robustness in unfavorable light and weather conditions. But these sensors are generally expensive, and
                thus do not represent the right choice for low-cost vehicles and robots. Moreover, they need to be used in
                a particular position that does not permit integrating it on the car changing both the appearance and the
                aerodynamics. Besides, their output is a point cloud data that, due to its structure is not easily manageable
                with deep learning models that promise outstanding results in various similar predictive tasks. For these reasons,
                in some applications, it is better to leverage other sensors like RGB cameras to estimate 3D perception. For this
                purpose, more classic approaches are based on stereo-cameras, RGB-D cameras, and stereo from motion, which
                generally can reconstruct the scene with less accuracy than LIDARS, but still produce acceptable results.
                    In recent years, several approaches have been proposed in literature which aim to estimate the depth from a
                monocular camera leveraging deep learning models. Some of these methods use a supervised approach,1, 2 however
                they mainly rely on annotated datasets which in practice can be labor-expensive to be collected. Thus, some
                works3, 4 use, on the contrary, self-supervised training procedure leveraging reprojection error. Notwithstanding
                their good performance, most of the proposed approaches use very deep neural networks, that are power- and
                resource- consuming and need high-end GPUs to produce results in real-time. For these reasons, these approaches
                cannot be used in systems with power and computational limits.
                    In this work, we propose a new approach based on a standard CNN proposed in the literature to deal with
                the image segmentation problem created not to be highly resource-dependent. For the training, we used the
                knowledge-distillation method using an out of shell pre-trained network as teacher network. We execute large
                scale experiments to qualitatively and quantitatively compare our results with those obtained with baselines.
                Moreover, we propose a deep study of inference times using both general-purpose and mobile architectures.
                Keywords: Depth prediction, Autonomous Driving, Monocular depth, Knowledge distillation
                    *Authors are listed in alphabetical order.
                    Corresponding author:
                    Gaetano Pernisco: E-mail: gaetano.pernisco@poliba.it

                                                 Multimodal Sensing and Artificial Intelligence: Technologies and Applications II
                                                   edited by Ettore Stella, Proc. of SPIE Vol. 11785, 117850J · © 2021 SPIE
                                                           CCC code: 0277-786X/21/$21 · doi: 10.1117/12.2596031

                                                                         Proc. of SPIE Vol. 11785 117850J-1
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

1. INTRODUCTION
Depth estimation task is a long studied problem in computer vision. Different sensors can provide the depth
information relying on different technologies, e.g. LIDAR take advantages from laser technology while Depth
cameras usually rely on structured light. Other methods dealt leveraging on geometric constraints between multi-
view images. In particular, most of approaches relied on stereo-vision and stereo-from-motion. This approaches
rely on the assumption that multiple views of the scene are a available, but in real application it is not even
possible. The problem of the depth estimation is a crucial task for some applications in autonomous driving,
robotics and augmented reality fields. In fact, this task permits an accurate three-dimensional reconstruction of
the environment and consequently is at the base of many complex applications.
Recently, an huge number of works accounts the monocular depth estimation as a supervised learning task
leveraging on Convolutional Neural Networks. This approaches, even if are able to reach amazing results, needs
a huge number of images with their respective ground truth depth image for the training. These datasets are
very difficult to create and require too many effort.
Estimate the depth of a scene from a single image is an ill-posed problem, since it cannot rely on the
epipolar geometric constraints. Humans are able to perform well this task making unconsciously assumptions
respect objects dimensions.5 Some recent works, account this problem in a self-supervised way exploiting the
photo-metric reprojection constraints. This approaches, in contrast with the supervised methods, are trained on
multi-views images of the same scene and do not need ground truth depth data. Furthermore, this datasets are
more simple to acquire and require less effort.
Most of the approaches proposed in literature rely on very deep CNN that are able to reach surprising results.
But this methods are power- and resource- consuming and need high-end GPUs to produce results in real-time.
For this reason, in applications where power and resources are not available, e.g. for augmented reality or mobile
robotics applications, is not possible to rely on them.
In this work we propose a new self-supervised approach based on knowledge distillation. We aim to propose
a pipeline able to exploit the knowledge guarded in a deep pre-trained CNN to transfer it on smaller one. Our
contributions are mainly three:

i We propose an approach able to distill the knowledge of an out of shell CNN, Monodepth2,6 and transfer
it in another network (DeepLabv3+7 ), to deal the monocular depth estimation task.
ii We propose to use a DeepLabv3+ architecture to deal with a task different from the segmentation, for
which it was created.

iii Finally we executed large scale experiments to test the accuracy of different versions of our model on
KITTI8 and measured the inference time of them on different hardware architecture to compare also their
performances.

2. RELATED WORK
Depth estimation from images is a long studied task in computer vision community. Most works focused either
on the use of stereo images,9 multiple images acquired from different view points,10 at different time11 or making
some assumption about the scene, e.g. static scene and viewpoint with different light.12, 13 These methods find
space in many applications but rely on multiple images of the same scene. In this section we aim to present
an overview of different works focused on monocular depth estimation, therefore methods leveraging on a single
input image, with a focus on methods casting this as a learning task.

2.1 Supervised Monocular Depth Estimation
Monocular depth estimation represents an ill-posed problem in which only one image is available at inference
time, therefore is not possible rely on geometric constraints. For this reason, it represents a deeply studied task
in last years.

Proc. of SPIE Vol. 11785 117850J-2
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Saxena et al.14 proposed a model called Make3D which leverage over-segmentation of the input image to
divide it in patches and then estimate 3D location and orientation of local planes that compose the scene. Planes’
parameters are predicted using a linear model and then they are combined together by means of a MRF model.
This method, however, tends to fail with thin structures and furthermore, does not consider the global structure
of the scene. Liu et al.2 to face up the monocular depth estimation by training a CNN, while Ladicky et al.15
leverage semantic information to improve the results. Karsch et al.16 rely on a database of images with the
corresponding depth to more consistent image level depth estimation. They account the problem as a image
matching problem and then they use temporal information to refine the result.
Eigen et al.1 proposed a method based on a multi-scale CNN trained in a supervised manner on images and
the corresponding depth-map. Differently from most of work explained until now, this approach infers depth
directly from the raw pixels value without relying on any segmentation or hand-crafted features. Relying on
Deep3D,17 Luo et al.18 modeled the monocular depth estimation task as a stereo matching problem, creating
the right view in a synthetic way. Kumar et al.19 leveraged the ability of recurrent neural networks (RNNs) to
learn spatio-temporal information to predict monocular depth from video sequences.
Atapour at al.20 aimed to tackle the problem on a real applications point of view. They formulated the
monocular depth estimation as an image-to-image translation problem and leveraging on adversarial training on
synthetic data proposed a network able to generalize enough to be domain independent. Ummenhofer et al.21
proposed DeMoN, a model based on a chain of autoencoders that can estimate at the same time both depth and
ego-motion from a sequence of monocular-images. Lastely, Chen et al.22 trained a model able to jointly estimate
depth and semantic segmentation while ViP-DeepLab23 estimate it jointly with panoptic segmentation.
All these methods need a large amount of images with their respective ground-truth on training set to learn
to estimate the depth-maps from monocular image. The creation of this dataset is not trivial, therefore this and
other works prefer to explore other possibilities.
2.2 Self-supervised Monocular Depth Estimation
DeepStereo proposed by Flynn et al.24 is trained in an unsupervised way on images acquired from different point
of view to synthesize new views. This model rely on the depth estimation to sample color from the neighboring
images. Nevertheless, since this architecture needs multiple images also at inference time, it is not suitable for
monocular depth estimation. Also Deep3D17 tackles the novel view estimation task in a binocular setup. This
work aim to generate the right view given the left one of the same scene in the context of 3D movies. Their
method leverages an image reconstruction loss to produce a distribution over all possible disparities for each
pixel. Zhan et al.25 proposed an autoencoder framework for monocular depth estimation on a stereo setup
leveraging the reconstruction loss. Their image synthesis is not fully differentiable, for this reason they linearized
their loss using a Taylor approximation making the training more difficult.
Godard et al. proposed Monodepth,4 in which overcome this problem using a bi-linear sampling. They
proposed a framework that, at training time, estimate depth from both left and right image in a stereo setup
in order to rely on left-right consistency as a training constraint. Inspired by this work, Aleotti et al.3 used an
adversarial framework to tackle the monocular depth estimation problem on a stereo setup; meanwhile Poggi et
al.26 developed 3Net a thin architecture to face the same task on CPU leveraging a trinocular assumption.
Zhou et al.27 proposed a framework trained on monocular videos to synthesize depth-maps without the stereo
constrain. At training time, it estimate depth by minimizing the reconstruction loss between temporal subsequent
frames, using a pose network to estimate the relative pose between them. With a similar aim, Mahjourian et
al.28 introduced an innovative ICP based loss to jointly estimate depth and ego-motion from unconstrained
monocular videos, being the first depth-from-video algorithm to use 3D information in a loss function.
Later, Godard et al. extended their previous work6 proposing an architecture fitting for both stereo pairs and
monocular video training. In this work they focused on artifacts mainly due to occlusions and moving objects.

3. PROPOSED METHOD
In this section, we will briefly introduce knowledge distillation29 concept, in particular the vanilla knowledge dis-
tillation used in this work. Then we will review Monodepth26 which we used as teacher model and Deeplabv3plus7
architecture which we choose as student model to enable monocular depth estimation on mobile systems.

Proc. of SPIE Vol. 11785 117850J-3
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Pretrained

                                                                                  Teacher Net

                                                                                                     Backpropagation
                                                                                                                       L1 Loss

                                                                                   To train

                                                                                   Student Net

                Figure 1: Proposed method for Knowledge Distillation to monocular depth estimation. The method use the
                output of a pre-trained network as training signal. In particular we used a pre-trained model of Monodepth26 as
                teacher network. We trained from scratch a student network (DeepLabv3+) relying on a simple L1-loss between
                the output of the teacher network and the one of the student network.

                3.1 Knowledge Distillation
                Knowledge distillation represents one of the most diffused type of model compression and acceleration. It is
                based on the idea of train a small student model from a large teacher model. This technique is particularly
                helpful in overcoming the problem of use deep models on devices with limited resources, e.g. mobile phones or
                embedded systems.
                   Gou et al.29 classify knowledge distillation algorithm from the perspective of knowledge categories, training
                schemes, teacher-student architecture and distillation algorithms. An exhaustive explanation of all the possible
                knowledge distillation schemes is outside the scope of this paper.
                   In our work we used an offline distillation scheme, in which the knowledge is transferred from a pre-trained
                teacher model to a student model. In this approach the training process is divided in two stages:

                     • Teacher model training: the teacher model is trained on a set of training data before distillation;
                     • Student model training: the student model is trained from scratch using as supervision signal the knowledge
                       extracted from the teacher model. The knowledge can be in the form of intermediate features or the output
                       of the model.

                As supervision signal, student model training, we used only the output image of the teacher model comparing
                it with the output of the student model. More precisely, during the training, we aim to minimize an L1 loss
                between the teacher and the student output.
                     Let Ft the teacher model, Fs the student model and i the input image, the L1-loss is defined as:

                                                                                 N
                                                                              1 X
                                                                         L1 =       |Ft (i)p − Fs (i)p |                         (1)
                                                                              N p=0

                where p is the pixel index, N is the total number of pixel in the image, Ft (i)p and Fs (i)p represent the p-th pixel
                of the output image respectively of the teacher and the student model.

                                                                         Proc. of SPIE Vol. 11785 117850J-4
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

3.2 The Teacher Network
                As we said on the beginning of this section, in the knowledge distillation pipeline we used an offline approach
                leveraging a pre-trained Monodepth26 model as teacher model.
                    This model was presented some years ago by Godard et al.6 and extends their previous work.4 The model
                aim to use a Convolutional Neural Network in a self-supervised setup to tackle the monocular depth estimation
                task both in a stereo and monocular setup. This approach account the learning problem as a new view-synthesis
                problem, in which the network learns to reconstruct the appearance of a target image that represent the same
                scene represented in the input image from another point of view. This model accounts the monocular depth
                estimation problem as a photo-metric reprojection error minimization problem. Let Tt→t0 the relative pose of
                each source image It0 with respect to the target image It ’s pose, the model aims to predict the depth map Dt
                that minimizes the reprojection error Lp defined as:

                                                                              Lp = min
                                                                                    0
                                                                                       pe(It , It→t0 )                            (2)
                                                                                       t

                where
                                                                         It→t0 = It0 hproj(Dt , Tt→t0 , K)i                       (3)
                where pe is the photometric reconstruction error, proj() is the projection of the depth map Dt in the It0 point
                of view, the h.i operator indicates the bi-linear sampling operation30 and K is the camera intrinsics matrix. For
                the photometric reconstruction error they use L1 and SSIM;31 more precisely pe is defined as:
                                                                         α
                                                        pe(Ia , Ib ) =     (1 − SSIM (Ia , Ib )) + (1 − α)kIa − Ib k1             (4)
                                                                         2
                with α = 0.85. They used also a smoothness loss Ls defined as:

                                                                     Ls = |∂x d∗t |e−|∂x It | + |∂y d∗t |e−|∂y It |               (5)

                where d∗t = dt /d¯∗t is the mean-normalized inverse depth.32 Finally, the final objective function is:

                                                                                  L = µLp + λLs                                   (6)

                    For stereo training, It0 and It are the two images on the stereo pair, which has known relative pose. For
                the monocular setup, they use the two frames temporally adjacent to It as source images. In this setup, the
                relative poses is not known. For this reason they use a pose estimation network to predict the relative pose
                between frames and use it in the projection function. This pose and depth estimation networks are trained
                simultaneously. For mixed training setup, It0 includes both temporally adjacent images and the second view on
                the stereo setup.

                3.3 The Student Network
                As said before, the aim of this work is to propose a model that can tackle the monocular depth estimation
                problem with limited computation resources. With this aim we choose to use as student network DeepLabv3+7
                architecture, that on the best of our knowledge is the state of art on network architectures developed for mobile
                applications. This network was proposed to deal a semantic segmentation task, but since it is based on an
                autoencoder structure it’s possible to adapt it to reach our goal. DeepLabv3+ is an extension of DeepLabv333
                in which a simple decoder is added to refine the results especially on boundaries.
                    The main idea behind DeepLabv3 is the implementation of the Atrous Spatial Pyramid Pooling (ASPP)
                composed by several parallel atrous convolution with different rates. Atrous convolutions are a generalization of
                the the convolution concept that allows to adjust the field of view of the filters to capture multi-scale information.
                Let x, y the input and output feature maps respectively and w the convolutional kernel, the Atrous convolution
                in every position i is computed as:                X
                                                            y[i] =      x[i + r · k]w[k]                                           (7)
                                                                                      k

                                                                         Proc. of SPIE Vol. 11785 117850J-5
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

where r is the the stride used to sample the input feature map. DeepLabv3 implements ASPP after a backbone
deep convolutional neural network to extract features at multiple scales by applyng several atrous convolutions
with different rates. In DeepLabv3+ the last feature map before logits of DeepLabv3 is used as encoder output;
this feature map contains 256 channels. The decoder is very simple, but it it was thought to recover object
details. The encoder output is firstly upsampled by a factor of 4 and then concatenated with the corresponding
low-level features from the backbone network with the same resolution. Before the concatenation, a point-wise
convolution is applied to the low-level features from the backbone just to reduce the channels number. After the
concatenation, few 3x3 convolutions are applied and finally a 4x bilinear upsampling.
This architecture permit to leverage both ASPP and autoencoders advantages for image-to-image translation
with limited resources.

4. EXPERIMENTS
Here we compere the performance and the execution time of our approach with the baseline Monodepth2.6 We
train our models using only the teacher supervision without any ground truth. We train and test our models
using the KITTI 20158 dataset. For the evaluation we compare different versions of our student model with
Monodepth26 model focusing on inference times.

4.1 KITTI Dataset
As in the work of Godard et al.4 we present our results on KITTI dataset8 using two different splits. This
dataset contains 61 scenes with a total of 42.382 rectified stereo pairs 1242x375 pixel. The two used splits are
the Eigen split1 and the Eigen Zhou.27
Eigen split1 uses a test split of 697 images from 29 different scenes. The remaining 32 scenes are splitted
using 22.600 stereo images for training and 888 for evaluation. To generate the ground truth , in this split, we
leverage LIDAR points reprojecting them on the left RGB camera.
Eigen Zhou27 is suited for monocular training and remove from the dataset the static frames. This split
contains a total of 39.810 monocular images for training and 4,424 for validation.
In all our experiments we use the same intrinsics for all images setting the camera principal point as the
image center and the focal length as the average of all the focal lengths on the KITTI dataset. Furthermore, we
crop the maximum depth to 80m as standard practice. In all the tests, we tested our models on Eigen split.

4.2 Results
In all our experiments we used as teacher network the pre-trained model of Monodepth26 trained with the
monocular and stereo approach (MS). We choose to use it because it reach the best results on KITTI. We used
an image resolution of 1024x320 by resizing the input images. Results are computed using the metrics proposed
by Eigen et al.1 This approach rely different metrics like Abs relative difference, Squared Relative difference,
RMSE and RMSE (log) which measure the difference in meter with the ground truth depth and other metrics
based on the percentage of depths that are within some threshold from the ground truth value. This is useful
because the non-thresholded metrics can be sensitive to large errors caused by prediction errors in depth caused
by predictions at small disparity values.
In table 1 are reported all the results obtained with the presented setup in comparison with our baseline
with is represented by or teacher network. The quantitative results show that using our knowledge distillation
approach with DeepLabv3+7 with ResNet10134 as backbone trained on Eigen split we are able to reach results
very similar to our baseline and overcome it in some metrics. Furthermore, as imaginable, using MobileNet35 as
backbone we reach results worse than other architectures (because this network uses less parameters) but them
still remain quite good.
Table 2 shows results about inference times. In particular are shown, for every network architecture shown
before, the number trainable parameters and the inference time on different hardware architectures. DeepLabv3+
with MobileNet as backbone is the network with the lower number of parameters, with about 6 times less
parameters respect Monodepth26 architecture (our baseline). For our experiments on general purpose hardware

Proc. of SPIE Vol. 11785 117850J-6
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

RMSE
                    Architecture                 Split       Abs Rel         Sq Rel         RMSE                δ < 1.25    δ < 1.252   δ < 1.253
                                                                                                     log
                    DeepLabv3+
                                                 Zhou        0.108           0.746          4.657    0.193      0.865       0.957       0.981
                    with MobileNet
                    DeepLabv3+
                                                 Zhou        0.121           0.975          5.051    0.216      0.841       0.937       0.967
                    with ResNet50
                    DeepLabv3+
                                                 Zhou        0.109           0.869          4.649    0.250      0.873       0.958       0.980
                    with ResNet101
                    DeepLabv3+
                                                 Eigen       0.110           0.871          5.139    0.333      0.863       0.955       0.980
                    with MobileNet
                    DeepLabv3+
                                                 Eigen       0.107           0.771          4.666    0.199      0.865       0.953       0.978
                    with ResNet50
                    DeepLabv3+
                                                 Eigen       0.104           0.752          4.552    0.193      0.875       0.958       0.980
                    with ResNet101
                    Monodepth2
                                                 Eigen       0.105           0.790          4.608    0.193      0.876       0.958       0.680
                    (baseline)

                Table 1: Comparison of different models. Results on KITTI 2015 stereo dataset with two different splits: Eigen1
                and Eigen Zhou.27 Here for the first four metrics (purple) lower is better and for the last 3 (yellow) higher is
                better.

                we used a PC with an Intel i7-5720K CPU which has 6 Cores with 3.30 GHz each and a GPU Nvidia GTX 970.
                For the experiments on mobile, instead, we used a Xiaomi Poco X3 NFC which is an Android 10 phone with
                a Qualcomm SM7150-AC Snapdragon 732G Octa-core. On Mobile we implemented a simple application using
                Pytorch Mobile36 and we ran the model on the CPU with multi-threading. The reported results are expressed
                in seconds and they are the average of times on 10 different executions. As possible to see in the table, the
                DeepLabv3+ with MobileNet is the faster on single and multiple CPU setup. On the contrary, on GPU and
                Mobile setup the faster architecture is Monodepth2.6

                                                                 # of trainable            Single     Multi       GPU        Mobile
                                    Architecture
                                                                 parameters                CPU [s]    CPU [s]     [s]        [s]
                                    DeepLabv3+
                                                                           2643281         0.4231     0.1445      0.0161     1.429
                                    with MobileNet
                                    DeepLabv3+
                                                                           26609233        1.3956     0.3881      0,0369     4.65
                                    with ResNet50
                                    DeepLabv3+
                                                                           45601361        1.9663     0.5139      0.06051    9.425
                                    with ResNet100
                                    Monodepth2
                                                                           14842236        0.7365     0.2611      0.0123     1.099
                                    (baseline)

                Table 2: Inference time. The table shows the number of trainable parameters and the time (in seconds) necessary
                to inference a single image with each architecture. The time is calculated as the average of 10 consecutive
                executions.

                                                                             5. CONCLUSION
                In this work we proposed a technique to deal the monocular depth estimation problem with a self supervised
                approach. We exploit an out of shell pre-trained model as supervisor signal in a very basic knowledge distillation
                pipeline. We adapted the DeepLabv3+ architecture to fit for our task and trained different versions of this model
                on different splits of KITTI dataset. We compared the results in terms of accuracy and inference times of all
                the analyzed models. We concluded that it is possible to deal the monocular depth estimation problem using

                                                                         Proc. of SPIE Vol. 11785 117850J-7
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

the DeepLabv3+ architecture and we reached results comparable with our baseline. Unexpectedly, even if our
                model contains a lower number of parameters, it results slower on Mobile setup compared with Monodepth2,6
                because the latter is more optimizable for Mobile.
                    In future will be useful to study a more complex method to knowledge distillation between this two network
                leveraging on connections between middle layers and introducing other loss functions typical of this domain based
                on image appearance and geometrical constrains. Furthermore, we would like to study a new custom model to
                deal the monocular depth estimation task able to run in real-time on Mobile hardware and test it on different
                indoor and outdoor datasets.

                                                                               REFERENCES
                 [1] Eigen, D., Puhrsch, C., and Fergus, R., “Depth map prediction from a single image using a multi-scale deep
                     network,” arXiv preprint arXiv:1406.2283 (2014).
                 [2] Liu, F., Shen, C., Lin, G., and Reid, I., “Learning depth from single monocular images using deep convo-
                     lutional neural fields,” IEEE transactions on pattern analysis and machine intelligence 38(10), 2024–2039
                     (2015).
                 [3] Aleotti, F., Tosi, F., Poggi, M., and Mattoccia, S., “Generative adversarial networks for unsupervised
                     monocular depth prediction,” in [Proceedings of the European Conference on Computer Vision (ECCV)
                     Workshops], 0–0 (2018).
                 [4] Godard, C., Mac Aodha, O., and Brostow, G. J., “Unsupervised monocular depth estimation with left-
                     right consistency,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition],
                     270–279 (2017).
                 [5] Howard, I. P., “Perceiving in depth, vol. 1: Basic mechanisms.,” (2012).
                 [6] Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J., “Digging into self-supervised monocular depth
                     estimation,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision], 3828–3838
                     (2019).
                 [7] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H., “Encoder-decoder with atrous separable
                     convolution for semantic image segmentation,” in [Proceedings of the European conference on computer
                     vision (ECCV)], 801–818 (2018).
                 [8] Menze, M., Heipke, C., and Geiger, A., “Joint 3d estimation of vehicles and scene flow,” in [ISPRS Workshop
                     on Image Sequence Analysis (ISA) ], (2015).
                 [9] Scharstein, D. and Szeliski, R., “A taxonomy and evaluation of dense two-frame stereo correspondence
                     algorithms,” International journal of computer vision 47(1), 7–42 (2002).
                [10] Furukawa, Y. and Hernández, C., “Multi-view stereo: A tutorial,” Foundations and Trends® in Computer
                     Graphics and Vision 9(1-2), 1–148 (2015).
                [11] Ranftl, R., Vineet, V., Chen, Q., and Koltun, V., “Dense monocular depth estimation in complex dynamic
                     scenes,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 4058–4066
                     (2016).
                [12] Woodham, R. J., “Photometric method for determining surface orientation from multiple images,” Optical
                     engineering 19(1), 191139 (1980).
                [13] Abrams, A., Hawley, C., and Pless, R., “Heliometric stereo: Shape from sun position,” in [European con-
                     ference on computer vision], 357–370, Springer (2012).
                [14] Saxena, A., Sun, M., and Ng, A. Y., “Make3d: Learning 3d scene structure from a single still image,” IEEE
                     transactions on pattern analysis and machine intelligence 31(5), 824–840 (2008).
                [15] Ladicky, L., Shi, J., and Pollefeys, M., “Pulling things out of perspective,” in [Proceedings of the IEEE
                     conference on computer vision and pattern recognition], 89–96 (2014).
                [16] Karsch, K., Liu, C., and Kang, S. B., “Depth transfer: Depth extraction from video using non-parametric
                     sampling,” IEEE transactions on pattern analysis and machine intelligence 36(11), 2144–2158 (2014).
                [17] Xie, J., Girshick, R., and Farhadi, A., “Deep3d: Fully automatic 2d-to-3d video conversion with deep
                     convolutional neural networks,” in [European Conference on Computer Vision ], 842–857, Springer (2016).

                                                                         Proc. of SPIE Vol. 11785 117850J-8
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

[18] Luo, W., Schwing, A. G., and Urtasun, R., “Efficient deep learning for stereo matching,” in [Proceedings of
                     the IEEE conference on computer vision and pattern recognition ], 5695–5703 (2016).
                [19] CS Kumar, A., Bhandarkar, S. M., and Prasad, M., “Depthnet: A recurrent neural network architecture
                     for monocular depth prediction,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern
                     Recognition Workshops ], 283–291 (2018).
                [20] Atapour-Abarghouei, A. and Breckon, T. P., “Real-time monocular depth estimation using synthetic data
                     with domain adaptation via image style transfer,” in [Proceedings of the IEEE Conference on Computer
                     Vision and Pattern Recognition ], 2800–2810 (2018).
                [21] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., and Brox, T., “Demon: Depth
                     and motion network for learning monocular stereo,” in [Proceedings of the IEEE Conference on Computer
                     Vision and Pattern Recognition ], 5038–5047 (2017).
                [22] Chen, Y., Li, W., Chen, X., and Gool, L. V., “Learning semantic segmentation from synthetic data: A
                     geometrically guided input-output adaptation approach,” in [Proceedings of the IEEE/CVF Conference on
                     Computer Vision and Pattern Recognition], 1841–1850 (2019).
                [23] Qiao, S., Zhu, Y., Adam, H., Yuille, A., and Chen, L.-C., “Vip-deeplab: Learning visual perception with
                     depth-aware video panoptic segmentation,” arXiv preprint arXiv:2012.05258 (2020).
                [24] Flynn, J., Neulander, I., Philbin, J., and Snavely, N., “Deepstereo: Learning to predict new views from
                     the world’s imagery,” in [Proceedings of the IEEE conference on computer vision and pattern recognition],
                     5515–5524 (2016).
                [25] Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., and Reid, I., “Unsupervised learning of
                     monocular depth estimation and visual odometry with deep feature reconstruction,” in [Proceedings of the
                     IEEE Conference on Computer Vision and Pattern Recognition], 340–349 (2018).
                [26] Poggi, M., Tosi, F., and Mattoccia, S., “Learning monocular depth estimation with unsupervised trinocular
                     assumptions,” in [2018 International conference on 3d vision (3DV)], 324–333, IEEE (2018).
                [27] Zhou, T., Brown, M., Snavely, N., and Lowe, D. G., “Unsupervised learning of depth and ego-motion
                     from video,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 1851–1858
                     (2017).
                [28] Mahjourian, R., Wicke, M., and Angelova, A., “Unsupervised learning of depth and ego-motion from monoc-
                     ular video using 3d geometric constraints,” in [Proceedings of the IEEE Conference on Computer Vision
                     and Pattern Recognition ], 5667–5675 (2018).
                [29] Gou, J., Yu, B., Maybank, S. J., and Tao, D., “Knowledge distillation: A survey,” International Journal of
                     Computer Vision 129(6), 1789–1819 (2021).
                [30] Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K., “Spatial transformer networks,” arXiv
                     preprint arXiv:1506.02025 (2015).
                [31] Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P., “Image quality assessment: from error visibility
                     to structural similarity,” IEEE transactions on image processing 13(4), 600–612 (2004).
                [32] Wang, C., Buenaposada, J. M., Zhu, R., and Lucey, S., “Learning depth from monocular videos using
                     direct methods,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition],
                     2022–2030 (2018).
                [33] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H., “Rethinking atrous convolution for semantic image
                     segmentation,” arXiv preprint arXiv:1706.05587 (2017).
                [34] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [Proceedings of
                     the IEEE conference on computer vision and pattern recognition ], 770–778 (2016).
                [35] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam,
                     H., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint
                     arXiv:1704.04861 (2017).
                [36] “Pytorch Mobile.” https://pytorch.org/mobile/home/. (Accessed: 27 May 2021).

                                                                         Proc. of SPIE Vol. 11785 117850J-9
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

You can also read