Adaptive Illumination based Depth Sensing using Deep Learning

Page created by Ryan Robinson

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Adaptive Illumination based Depth Sensing using Deep Learning

Qiqin Dai1 , Fengqiang Li2 , Oliver Cossairt2 , and Aggelos K Katsaggelos2
1
Geomagical Labs
2
Northwestern University
arXiv:2103.12297v1 [cs.CV] 23 Mar 2021

Abstract tance, and occlusion problems can also be minimized with
them. Compared to indirect ToF cameras, LiDAR has a
Dense depth map capture is challenging in existing ac- much longer imaging range (eg., tens of meters) with a high
tive sparse illumination based depth acquisition techniques, depth precision (eg., mm) which enables it to be a competi-
such as LiDAR. Various techniques have been proposed to tive depth sensor. LiDAR has been widely used for machine
estimate a dense depth map based on fusion of the sparse vision applications, such as, navigation in self-driving cars
depth map measurement with the RGB image. Recent ad- and mobile devices (e.g., LiDAR sensor on Apple iPhone
vances in hardware enable adaptive depth measurements 12).
resulting in further improvement of the dense depth map Most LiDAR sensors measure one point of the object at
estimation. In this paper, we study the topic of estimat- a time, and rely on raster scanning to generate a full 3D im-
ing dense depth from depth sampling. The adaptive sparse age of the object, which limits its acquisition speed. In order
depth sampling network is jointly trained with a fusion net- to produce a full 3D image of an object with a reasonable
work of an RGB image and sparse depth, to generate op- frame rate of a mechanical scanning scheme, LiDAR can
timal adaptive sampling masks. We show that such adap- only provide a sparse scanning with two adjacent measur-
tive sampling masks can generalize well to many RGB and ing points being far away. This leads to very limited spatial
sparse depth fusion algorithms under a variety of sampling resolution with LiDAR sensor. To increase LiDAR’s spa-
rates (as low as 0.0625%). The proposed adaptive sampling tial resolution and acquire more structural information of
method is fully differentiable and flexible to be trained end- the scene, other high-spatial-resolution imaging modalities,
to-end with upstream perception algorithms. such as RGB, are used to be fused with LiDAR’s depth im-
ages [56, 37]. Traditionally, LiDAR sensors perform raster
scanning following a regular grid to produce a spatially uni-
1. Introduction formly spaced depth map. Recently, researchers explore the
LiDAR scanning or sampling patterns and co-optimize the
Depth sensing and estimation is important for many ap- sensor’s sampling pattern with a fusion pipeline to further
plications, such as autonomous driving [35], augmented re- increase its performance [4]. Pittaluga et al. [45] implement
ality (AR) [20], and indoor perception [18]. on a real hardware device adaptive scanning patterns for the
Based on the principle of operation, we can roughly LiDAR sensor by using a MEMS mirror. They co-optimize
divide current depth sensors into two categories: (1) the scanning hardware and the fusion pipeline with an RGB
Triangulation-based depth sensors (eg., stereo [26]) and (2) sensor to increase LiDAR’s performance.
Time-of-flight (ToF) based depth sensors (including direct In this paper, we study the topic of adaptive depth sam-
ToF LiDAR sensor and indirect ToF cameras [14]). The pling and depth map reconstruction. First, we formulate
depth precision of triangulation-based depth sensors varies the pipeline of joint adaptive depth sampling and depth
directly with the baseline and inversely with standoff dis- map estimation. Then, we propose a deep learning (DL)
tance. Therefore, large baselines are required to realize based algorithm for adaptive depth sampling. We show that
higher depth precision at longer standoff distances. More- the proposed adaptive depth sampling algorithm can gen-
over, the geometric arrangement of the sensor components eralize well to many depth estimation algorithms. Finally,
can lead to occlusion problems, which are not plausible we demonstrate a state-of-the-art depth estimation accuracy
in depth sensing. On the contrary, the depth precision of compared to other existing algorithms.
ToF-based depth sensors is independent of the standoff dis- Our contribution is summarized as follows:

4321

2.1. Depth Estimation
Given RGB images, early depth prediction methods re-
lied on hand-crafted features and probabilistic graphics
RGB Image models. Karsch et al. [27, 28] estimate the depth based
on querying an RGBD image database. A Markov random
field model is applied in [51] to regress depth from a set of
image features. Recently deep learning (DL) and convolu-
tional neural networks (CNNs) have been applied to learn
Predicted Depth using Randomly Sampled Depth, RMSE: 2052.5
the mapping from single RGB images to dense depth maps
[12, 11, 32, 15, 46, 2, 33, 57]. These DL based approaches
achieve state-of-the-art performance because better features
are extracted and better mappings are learned from large-
Predicted Depth using Adaptively Sampled Depth, RMSE: 1146.9 scale datasets [53, 16, 49].
Given sparse depth measurements, traditional image fil-
tering and interpolation techniques [30] can be applied to
reconstruct the dense depth map. Hawe [19] and Liu [39]
study the sparse depth map completion problem from the
Depth Ground Truth compressive sensing aspect. DL techniques have also been
applied to the sparse depth completion problem. A sparse
Figure 1. LiDAR systems is able to capture accurate sparse depth depth map can either be fed into conventional CNNs [43]
map (bottom). Reducing the number of samples we are able to in-
or sparsity invariant CNNs [55]. When the sampling rate is
crease the capture framerate. RGB image (top) can be fused with
low, the sparse depth map completion task is challenging.
the captured sparse depth data and estimate a dense depth map.
We demonstrate that choosing the sampling location is important If both RGB images and sparse depth measurements are
to the accuracy of the estimated depth map. Under 0.25% sam- provided, traditional guided filter approaches [34, 3] can be
pling rate (with respect to the RGB image), using the same depth applied to refine the depth map. Optimization algorithms
estimation method [56], the depth map estimated from the adap- that promote depth map priors while maintaining fidelity to
tively sampled sparse depth (third row) is more accurate than the the observation are proposed in [40, 10, 41]. Various DL
depth map estimated from random samples (second row). based methods have been developed [43, 42, 56, 6, 25, 59,
61, 52]. During training and testing, most DL approaches
are trained and tested using random or regular grid sam-
• We propose an adaptive depth sensing framework pling masks. Because depth completion is an active re-
which benefits from active sparse illumination depth search area, we do not want to limit our adaptive sampling
sensors, such as LiDAR and sparse dot-pattern struc- method to a specific depth estimation method.
tured light sensors.
2.2. Sampling Mask Optimization
• We propose a superpixel segmentation based adaptive
sampling mask prediction network and experimentally Irregular sampling [7, 44, 5] is well studied in the com-
show better sampling performance compared to exist- puter graphics and image processing literature to achieve
ing sampling methods. good representation of images. Making the sampling dis-
tribution adaptive to the signal further improves representa-
• We propose a differentiable sampling layer, that trans- tion performance. Eldar et al. [13] proposed a farthest point
lates the estimated sampling locations (x, y coordi- strategy which performs adaptive and progressive sampling
nates) into a binary sampling mask. of an image. Inspired by the lifting scheme of wavelet
generation, several progressive image sampling techniques
• We demonstrate that the proposed adaptive sampling were proposed [9, 47]. Ramponi et al. [48] applied a mea-
method can generalize well to many depth estimation sure of the local sample skewness. Lin et al. [36] utilized
algorithms without fine tuning, thus establishing the the generalized Ricci curvature to sample grey scale images
effectiveness of the proposed sampling method. as manifolds with density. A kernel construction technique
is proposed in [38]. Taimori et al. [54] investigated space-
2. Related Work frequency-gradient information of image patches for adap-
tive sampling.
In this section, we review work on algorithm-based depth Specific reconstruction algorithms are needed for each
estimation and sampling mask optimization and clarify the of these irregular or adaptive sampling methods [7, 13, 47,
relationship to our proposed method. 9, 48, 36, 38, 54] to reconstruct the fully sampled signal.

4322

NetM

RGB Image Sampling Mask * NetE

Sampled Sparse Depth Predicted Depth

Depth Ground Truth

Figure 2. The proposed pipeline contains two submodules, adaptive depth sampling (N etM ) and depth reconstruction (N etE). The binary
adaptive sampling mask is generated by N etM based on the RGB image. Then, the LiDAR system samples the scene based on this binary
sampling mask and generates the sampled sparse depth map. Finally, both the RGB image and the sampled sparse depth map are input to
N etE to estimate a dense depth map.

Furthermore, handcrafted features are applied to these sam- sampling networks [23, 58] and depth estimation networks
pling methods. Finally, these sampling techniques are all [7, 13, 47, 9, 48, 36, 38, 54] could adapt the sampling net-
applied to the same modality (RGB or grey scale image). work to depth estimation and obtain improved reconstruc-
Recently, Dai et al. [8] applied DL technique to the adap- tion accuracy. Bergman et al. [4] warp an uniform sampling
tive sampling problem. The adaptive sampling network is grid to generate the adaptive sampling mask. The warping
jointly optimized with the image inpainting network. The vectors are computed utilizing DL based optical flow esti-
sampling probability is optimized during training, and bi- mated from the RGB image. A spatial distribution prior is
narized during testing. Good performance is demonstrated enforced by the initial uniform sampling grid. End-to-end
for X-ray fluorescence (XRF) imaging at a sampling rate as optimization of the sampling and depth estimation networks
low as 5%. Kuznetsov et al. [31] predicted adaptive sam- is performed and good depth reconstruction is obtained un-
pling maps jointly with reconstruction of Monte Carlo (MC) der low sampling rates. In the pipeline of [4], there are 4
rendered images using DL. A differentiable render simula- sub-networks, 2 for sampling and the other 2 for depth es-
tor with respect to the sampling map was proposed. Huijben timation. They are jointly trained but only the final depth
et al. [21, 22] proposed a task adaptive compressive sens- estimation results are demonstrated. The whole pipeline is
ing pipeline. The sampling mask is trained with respect to bulky and expensive. More importantly, it is hard to ac-
a specific task and is fixed during imaging. Gumbel-max cess if the improvement on depth estimation comes from the
trick [17, 24] is applied to make the sampling layer differ- sampling part or the depth estimation part of the pipeline. In
entiable. this paper, we decouple these two parts and study each in-
dividual module to better understand their contribution to-
All of the above DL based sampling methods predict a wards the final depth estimate. Finally, a bilinear sampling
per pixel sampling probability [8, 21, 22] or a sampling kernel is applied in [4] to make the optimization of the sam-
number [31]. Good sampling performance has not been pling locations differentiable. On the contrary, we propose
demonstrated under extreme low sampling rates (< 1%). a novel differentiable relaxation of the sampling procedure
Directly enforcing priors on sampling locations is effec- and show its advantages over the bilinear sampling kernel.
tive when the sampling rate is low. This requires the adap-
tive sampling network to predict sampling locations ((x, y)
coordinates) directly and the sampling process be differ- 3. Method
entiable. For the RGB and sparse depth adaptive sam- 3.1. Problem Formulation
pling task, Wolff et al. [58] use the SLIC superpixel tech-
nique [1] to segment the RGB image and sample the depth As shown in Figure 2, the input RGB image is denoted
map at the center of mass of each superpixel. A bilateral by I. The mask generation network N etM produces a bi-
filtering based reconstruction algorithm is proposed to re- nary sampling mask B = N etM (I, c), where c ∈ [0, 1]
construct the depth map. A spatial distribution prior is is the predefined sampling rate. Elements in B equal to
implicitly enforced by superpixel segmentation, resulting 1 correspond to sampling locations and 0 to non-sampling
in good sampling performance under low sampling rates. location. Then the LiDAR system is sampling depth ac-
The sampling and reconstruction methods are not opti- cording to B and produces the measured sparse depth map
mized jointly, leaving room for improvement. In this paper, D0 . In synthetic experiments, if the ground truth depth map
we show that jointly training recent DL based superpixel D is given, the measured sparse depth map D0 is obtained

4323

1x1 conv upsample upsample upsample upsample bilinear
 ResNet18 3x3 conv upsample
 batch-norm layer layer layer layer
 4@ 1@
 32@ 16@ 1@
 240x960 512@ 256@ 128@ 240x960
 64@ 128x480 128x480
 8x30 8x30 16x60 64x240
 32x120

 Encoder Decoder

Figure 3. Network architecture for the depth estimation network (N etE). The network consists of an encoder and a decoder. The RGB
image and the corresponding sparse depth image are concatenated at the input.

according to ure 2, N etM adapts to the task of depth sampling after be-
 ing jointly trained with N etE.
 D0 = D B=D N etM (I, c), (1) Superpixel with fully convolutional networks (FCN) [60]
 is one of the DL based superpixel techniques. It predicts
where is the element-wise product operation. The recon- the pixel association map Q given an RGB image I. Its
structed depth map D̄ is obtained by the depth estimation encoder-decoder network architecture is shown in Figure 4.
network N etE, that is, Similarly to the SLIC superpixel method [1], a combined
 loss that enforces similarity property of pixels inside one
 superpixel and spatial compactness is applied. Readers can
 D̄ = N etE(I, D0 ) = N etE(I, D N etM (I, c)). (2)
 refer to [60] for more details.
 The overall adaptive depth sensing and depth estimation Given an RGB image I with spatial dimensions (H, W ),
pipeline is shown in Figure 2. End-to-end training can be under the desired depth sampling rate c, we have Np =
applied on N etM and N etE jointly. The adaptive depth H · W pixels and Ns = c · H · W superpixels. The
sampling strategy is learned by N etM , while N etE esti- sampled depth location is the weighted mass center of
mates the final dense depth map. An informative sampling each superpixel. We denote the subset of pixels as P =
mask is beneficial to depth estimation algorithms in gen- {P0 , ..., PNs −1 }, where Pi is a set of pixels associated with
eral, not just to N etE. During testing, we can replace the superpixel i. Pixel p’s CIELAB color property and (x, y)
inpainting network N etE with other depth estimation algo- coordinates are denoted by f (p) ∈ R3 and c(p) ∈ R2 , re-
rithms. Network architectures and training details of N etE spectively. The loss function is given by
and N etM are discussed in the following subsections.
 X
3.2. Depth Estimation Network N etE LSLIC (f , Q) = kf (p) − f 0 (p)k2 + mkc(p) − c0 (p)k2 .
 p∈P
 We use the network architecture in [43] for the depth es- (3)
timation network, as shown in Figure 3. The network is Here we have
an encoder-decoder pipeline. The encoder takes a concate- P P
 f (p)qs (p) c(p)qs (p)
nated I and D0 as input (4 channels) and encodes them into us =
 p∈Ps
 P , ls =
 p∈Ps
 P , (4a)
latent features. The decoder takes the low spatial resolution p∈Ps qs (p) p∈Ps qs (p)
feature representation and outputs the restored depth map
 X X
 f 0 (p) = us qs (p), c0 (p) = ls qs (p), (4b)
D̄ = N etE(I, D0 ). s∈Np s∈Np
 Because method [43] is differentiable with respect to D0
(unlike [6]) and its network architecture is standard with- where m is a weight balancing term between the CIELAB
out customized fusion modules [25, 56, 6], we choose it as
N etE and jointly train N etM with it according to Figure 2.
We found out that the trained N etM can generalize well to
other depth estimation methods during testing.

3.3. Sampling Mask Generation Network N etM
 Existing irregular sampling techniques [13, 5] and adap-
tive depth sampling methods [4, 58] explicitly or implic-
itly make sampling points evenly distributed spatially. Such
 Input RGB Output association
prior is important when the sampling rate is low. In- Image Map 

spired by the SLIC superpixel [1] based adaptive sampling
method [58], we propose to utilize recent DL based super- Figure 4. Superpixel FCN [60]’s encoder-decoder network archi-
pixel networks [60, 23] as N etM . As demonstrated in Fig- tecture.

 4324

2 2
e−ρi /t
…

…

…
ki = P 2 2
. (6)
l! l" w# l! e−ρj /t
… … … … … … .
j∈NW
" !

w" w! w"
When the temperature parameter t → 0, the sampled
…

…

…
depth value ds is equal to the depth value dn of the nearest
pixel wn . When t is large, the soft sampled depth value ds is
(a) (b) (c) different from dn . We gradually reduce t during the training
process. During testing, we find the nearest neighbor pixel
Figure 5. Illustration of the sampling approximation. (a) We find a wn of ls and sample the depth value dn at wn .
local window W of ls and compute distance ρi . (b) We represent
ls ’s depth value ds as a linear combination of local window W ’s 3.5. Training Procedures
depth values. (c) During testing, we sample the depth value at the
nearest neighbour wn of ls .
Given the training dataset consisting of the aligned RGB
image I and the ground truth depth map D, we first train
N etE by minimizing the depth loss,
color similarity and spatial compactness, Np is the set of su-
Ldepth = kD − N etE(I, D0 )k2 , (7)
perpixels surrounding p, qs (p) is the probability of a pixel
p being associated with superpixel s and is derived from where D0 is obtained by applying a random sampling mask
the associate map Q, us ∈ R3 and ls ∈ R2 are the color on D with sampling rate c.
property and locations of superpixel s, f 0 (p) ∈ R3 and Then we initialize the superpixel network N etM using
c0 (p) ∈ R2 are respectively the reconstructed color prop- the RGB image I. LSLIC is minimized according to Equa-
erty and location of pixel p. tion 3. The initialized N etM approximates the SLIC super-
pixel segmentation on RGB image. If we sample the depth
3.4. Soft Sampling Approximation
value on ls of each superpixel, the sampling pattern would
Defined in Equation 4(a), we denote the collection of be similar to [58].
ls (), s = 0, ..., Ns − 1 , as S. Depth values at loca- Finally, we jointly train N etE and N etM in Figure 2 by
tions S would be measured during the depth sampling. In minimizing
order to train N etM and N etE jointly, the sampling op- L = Ldepth + q · LSLIC , (8)
eration g, which computes the sampled sparse depth map where q is the weighting terms of LSLIC . The SSA trick
D0 from depth ground truth D and sampling location S, shown in Figure 5 is applied and the temperature parameter
D0 = g(D, S), needs to be differentiable with respect to t gradually decreases during training.
S. Unfortunately, such sampling operation g is not differen- We fix N etE when training N etM . Optimizing N etE
tiable in nature. Bergman et al. [4] apply a bilinear sampling and N etM simultaneously would obtain optimal depth re-
kernel to differentiably correlate S and D0 . The computed construction accuracy [4]. However, similarly to [8], we
gradients rely on the 2 × 2 local structure of the ground would utilize other depth estimation methods than N etE
truth depth map D. The computed gradients are not stable during testing. We want to make the adaptive depth sam-
when the sampling location is sparse. Thus limited sampling mask be general and applicable to many depth esti-
pling performance is obtained. We propose a soft sampling mation algorithms.
approximation (SSA) strategy during training. SSA utilizes
a larger window size compared to the bilinear kernel and 4. Experimental Results
achieves better sampling performance.
As shown in Figure 5, during training, given a sampling 4.1. Implementation Details
location ls ∈ S, we find a local h×w window W around ls . We use the KITTI depth completion dataset [55] for our
The depth value ds at ls is a weighted average of the depth experiments. It consists of aligned ground truth depth maps
values in W , (from LiDAR sensor) and RGB images. The original KITTI
X training and validation set split is applied. There are 42949
ds = ki di , (5) and 3426 frames in the training and validation sets, respec-
i∈NW
tively. We only use the bottom center crop 240 × 960 of the
where NW includes the indices of all pixels in W , wi is the images because the LiDAR sensor has no measurements at
ith pixel’s location in W , di is the depth value of wi , the the upper part of the images.
weights ki are computed according to the Euclidean dis- The ground truth depth maps are not dense because they
tance ρi between ls and wi , scaled by a temperature param- are measured by a velodyne LiDAR device. In order to per-
eter t, form adaptive depth sampling, we need dense depth maps to

4325

MAE (mm)
c=1% c=0.25% c=0.0625%
FusionNet SSNet N etE Colorization FusionNet SSNet N etE Colorization FusionNet SSNet N etE Colorization
Random 324.8 466.6 425.7 764.6 488.6 654.9 557.1 1390.7 798.5 1021.1 779.4 2517.5
Poisson 324.1 451.8 409.6 711.5 455.8 621.2 537.5 1314.1 736.2 901.0 743.1 2428.4
SPS 297.2 439.9 388.1 654.3 436.9 594.2 507.8 1197.2 713.8 865.2 724.5 2175.9
DAL 295.8 447.4 390.1 683.4 432.5 599.1 504.8 1239.7 694.4 838.4 710.2 2230.7
N etM 285.0 423.1 380.1 656.2 404.3 562.2 477.5 1189.2 634.9 778.0 652.2 2265.8
RMSE (mm)
c=1% c=0.25% c=0.0625%
FusionNet SSNet N etE Colorization FusionNet SSNet N etE Colorization FusionNet SSNet N etE Colorization
Random 1060.0 1221.6 1294.8 1984.4 1476.0 1709.9 1704.3 3087.3 2135.6 2505.6 2262.61 4749.8
Poisson 1010.1 1140.2 1193.3 1844.8 1375.6 1589.1 1596.8 2897.3 2013.1 2256.0 2151.4 4508.6
SPS 1039.1 1124.3 1160.5 1742.9 1360.1 1559.7 1553.1 2718.1 1993.8 2215.0 2141.5 4161.1
DAL 969.9 1115.1 1177.5 1784.3 1336.1 1548.1 1532.1 2772.6 1937.6 2128.3 2085.7 4242.3
N etM 939.4 1074.9 1131.3 1725.9 1239.7 1436.8 1422.4 2584.5 1732.4 1930.5 1896.7 3972.9
Table 1. Depth sampling and estimation results on KITTI test dataset. Random, Poisson [5], SPS [58], DAL [4] and proposed N etM
sampling strategies are compared utilizing N etE [43], FusionNet [56], SSNet [42], and Colorization [34] depth estimation algorithms.
MAE and RMSE metrics are reported. Best results are shown in bold. The results shown are averaged over a set of 3426 test frames.

sample from. Similarly to [4], a traditional image inpainting methods. Learning rate of N etM is assigned to be equal to
algorithm [34] is applied to densify the depth ground truth. 10−4 and is reduced by 50% every 10 epochs. SGD opti-
During evaluation, we compare the estimated dense depth mizer with momentum 0.9 is used. We found that 50 epochs
maps to the original sparse ground truth depth maps. in total are adequate for converge.
During the training of N etE, we follow Ma et al.’s Our proposed adaptive depth sampling framework
setup [43]. The batch size is set equal to 16. The ResNet is implemented in PyTorch and our implementation
encoder in Figure 3 is initialized with pretrained weights is available at: https://github.com/usstdqq/
using the ImageNet dataset [50]. Stochastic gradient de- adaptive-depth-sensing.
scent (SGD) optimizer with momentum 0.9 is used. We
train 100 epochs in total. The learning rate is set to be 4.2. Performance on Adaptive Depth Sensing and
equal to 0.01 at first and reduced by 80% at every 25 Estimation
epochs. N etE is trained individually under different sam- For the adaptive depth sampling and estimation task,
pling rates c = 1%, 0.25% and 0.0625% using random we demonstrate the advantages of our proposed adaptive
sampling masks. We also train FusionNet [56] and SS- sampling mask N etM , over the use of random and Pois-
Net [42] under different sampling rates using random sam- son [5] sampling masks, as well as other state-of-the-art
pling masks. The same training procedure in their original adaptive depth sampling methods, such as SuperPixel Sam-
papers are used. They serve as alternative depth estimation pler (SPS) [58] and Deep Adaptive Lidar (DAL) [4].
methods. Random, Poisson, SPS [58], DAL [4] and proposed
We test the proposed sampling algorithm under 3 sam- N etM sampling methods are applied to the 3426 test
pling rates, c = 1%, 0.25% and 0.0625%. They correspond frames from the KITTI validation set. Sampling rates
to Ns = 2304, 576 and 144 depth samples (superpixels) in c = 1%, 0.25% and 0.0625% are tested. For the depth
the 240 × 940 image. N etM is configured to output the estimation methods, DL based methods N etE [43], Fu-
desired number of samples. During the training of N etM , sionNet [56], SSNet [42] and traditional method Coloriza-
we pretrain it using the SLIC loss. m in Equation 3 is set tion [34] are used to estimate the fully sampled depth map
equal to be 1. ADAM optimizer [29] is applied. Learning from the sampled depth map and RGB image. It’s noted
rate is set to be 5 × 10−5 . We train 100 epochs in total. that all the DL based depth estimation methods are trained
After N etM is initialized, we finally jointly train N etM using random sampling masks and the same KITTI training
and N etE according to Figure 2. Loss defined in Equa- dataset.
tion 8 is optimized with q equal to 10−6 , resulting in Ldepth The average Root Mean Square Error (RMSE) and Mean
being equal to about 10 times of q · LSLIC in value. The Absolute Error (MAE) over all 3426 test frames are shown
window size of the soft depth sampling module is equal to in Table 1. First, under all three sampling rates, the
5. Temperature t defined in Equation 6 decreases from 1.0 proposed N etM mask outperforms the random, Poisson,
to 0.1 linearly during training. Batch size is set equal to 8 SPS and DAL masks consistently over all depth estimation
and this is the largest batch size we can use for both N etM methods in terms of RMSE and MAE. This demonstrates
and N etE in an NVIDIA 2080Ti GPU (11GB memory). As the effectiveness of our proposed adaptive depth sampling
discussed in Section 3.5, N etE is fixed during the training network. Furthermore, N etM is jointly trained with N etE
to make N etM generalize well to other depth estimation and it still performs well with other depth estimation meth-

4326

RGB Image Depth Ground Truth

RMSE: 1612.2 RMSE: 1913.8 RMSE: 1852.0 RMSE: 3604.3
Random Sampling + FusionNet Reconstruction Random Sampling + SSNet Reconstruction Random Sampling + N etE Reconstruction Random Sampling + Colorization Reconstruction

RMSE: 1617.8 RMSE: 1873.0 RMSE: 1965.0 RMSE: 3260.5
Poisson Sampling + FusionNet Reconstruction Poisson Sampling + SSNet Reconstruction Poisson Sampling + N etE Reconstruction Poisson Sampling + Colorization Reconstruction

RMSE: 1559.9 RMSE: 1902.9 RMSE: 1848.4 RMSE: 3280.7
SPS Sampling + FusionNet Reconstruction SPS Sampling + SSNet Reconstruction SPS Sampling + N etE Reconstruction SPS Sampling + Colorization Reconstruction

RMSE: 1561.6 RMSE: 1883.8 RMSE: 1839.4 RMSE: 3376.6
DAL Sampling + FusionNet Reconstruction DAL Sampling + SSNet Reconstruction DAL Sampling + N etE Reconstruction DAL Sampling + Colorization Reconstruction

RMSE: 1248.5 RMSE: 1531.2 RMSE: 1505.1 RMSE: 3028.1
N etM Sampling + FusionNet Reconstruction N etM Sampling + SSNet Reconstruction N etM Sampling + N etE Reconstruction N etM Sampling + Colorization Reconstruction

Figure 6. Visual comparison of the estimated depth maps. Random, Poisson, SPS, DAL, and N etM sampling masks at sampling rate
c = 0.25% are applied and shown in the 2nd − 6th rows, respectively. The first row includes the RGB image and the ground truth depth
map. Sampling locations are indicated using black dots. FusionNet, SSNet, N etE and Colorization depth estimation methods are used to
perform depth estimation and generate the depth maps of 1st − 4th columns, respectively. RMSE is computed for each depth map with
respect to the ground truth depth map.

ods, demonstrating that it can generalize well to other depth FusionNet SSNet N etE
c Kernel MAE RMSE MAE RMSE MAE RMSE
estimation methods. The performance advantage of N etM Bilinear 290.5 948.4 436.3 1086.7 383.0 1138.8
1%
is not tied to any specific depth estimation method. Fi- SSA 285.0 939.4 423.1 1074.9 380.1 1131.3
nally, it can be concluded that the smaller the sampling rate, Bilinear 431.1 1285.3 590.1 1487.5 494.1 1466.0
0.25%
SSA 404.3 1239.7 562.2 1436.8 477.5 1422.4
the larger the advantage of N etM compared to other sam- 0.0625%
Bilinear 809.1 2161.4 989.2 2457.4 781.6 2253.0
pling algorithms. This implies that N etM is able to handle SSA 634.9 1732.4 778.0 1930.5 652.2 1896.7

challenging depth sampling tasks (extremely low sampling
Table 2. Using SSA and Bilinear kernel during training results dif-
rates).
ferent sampling quality of N etM .
The visual quality comparison of various sampling
strategies and depth estimation methods is shown in Fig-
ure 6. For sampling rate equal to c = 0.25%, we can ure 7, we visualize the superpixel segmentation results and
observe the advantages of the proposed N etM mask over the derived sampling locations for SLIC, FCN and N etM
all other sampling masks by comparing the resulting depth when c = 0.0625%. SLIC and FCN segment the input
maps by the same depth estimation algorithm. N etM sam- RGB image based on the color similarity and preserve spa-
ples densely around the end of the road, trees and billboard, tial compactness. The segmentation density is spatially ho-
resulting in accurate depth estimation in such areas. Com- mogeneous. N etM is jointly trained with N etE, thus it
pared to other adaptive sampling algorithms, such as SPS has knowledge of distance given the RGB input image. Dis-
and DAL, N etM samples more densely on distance ob- tance objects in the image are sampled denser. It also seg-
jects, making the estimated depth more accurate. SPS uses ments sparsely the pavement and grass areas. Such near
SLIC [1] to segment the RGB image and such segmentation objects as cars are segmented denser compared to the pave-
can not obtain distance information from the RGB images. ment and grass areas. We also observe that N etM segmen-
DAL estimates a smooth motion field to warp a regular sam- tation does not preserve color pixel boundaries as well as
pling grid. When the scene is complicated, it is not flexible FCN, which is expected as N etM also minimizes the depth
enough to warp a regular sampling grid to optimal location. estimation loss besides the SLIC loss in Equation 8.
N etM is initialized with RGB images trained FCN [60]
superpixel network using the SLIC loss (Equation 3). 4.3. Effectiveness of Soft Sampling Approximation
SPS [58] uses the SLIC superpixel technique to segment
the RGB images. Sampling locations are determined by the In Section 3.4, we propose the use of SSA to make the
weighted mass center of superpixels. Different superpixel sampling process differentialble during training. Such dif-
segmentations result in different sampling quality. In Fig- ferentiable sampling approximation is necessary to jointly

4327

RGB Image RGB Image
SLIC

RMSE: 1039.0 RMSE: 1170.8
SPS Sampling + SPS Reconstruction SPS Sampling + SPS Reconstruction

RMSE: 1350.0 RMSE: 1239.1
DAL Sampling + DAL Reconstruction DAL Sampling + DAL Reconstruction
FCN

RMSE: 871.9 RMSE: 1022.5
N etM Sampling + FusionNet Reconstruction N etM Sampling + FusionNet Reconstruction
N etM

RMSE: 847.2 RMSE: 912.8
N etM ∗ Sampling + FusionNet* Reconstruction N etM ∗ Sampling + FusionNet* Reconstruction

Figure 7. Visual comparison of different superpixel segmentation Depth Ground Truth Depth Ground Truth
and sampling location. Segmentation boundaries are plotted in
green and sampling locations are plotted in red. Figure 8. Visual comparison of different depth sampling and esti-
mation methods.

train N etM with N etE. Compared to the 2×2 bilinear ker-
nel based differentable sampling in [4], the proposed SSA with N etE and FusionNet is trained using random masks.
provides better sampling performance. In order to show the Similarly to DAL, N etM and FusionNet can also be op-
advantages of SSA, we replace the SSA sampling of N etM timized simultaneously. Starting from the N etE trained
by the bilinar kernel based sampling and perform the exact N etM and random mask trained FusionNet, we alterna-
same training procedures. As demonstrated in Table 2, the tively train N etM and FusionNet and denote the trained
lower the sampling rate, the bigger the advantage of SSA networks by N etM ∗ and FusionNet*, respectively. The
over the bilinear kernel sampling. When sampling points joint depth sampling and reconstruction results are shown
are sparse, the gradients derived from a 2 × 2 local window in Table 3. We also compare with the sampling and re-
are too small to train N etM effectively. We empirically construction methods proposed in SPS and DAL. N etM ∗
found that the 5 × 5 window size for SSA provides reason- with FusionNet* slightly outperforms N etM with Fusion-
able sampling performance under all sampling rates. Net and achieves the best accuracy. Utilizing random sam-
pling masks during the training of depth estimation meth-
4.4. End To End Depth Estimation Performance ods (FusionNet, SSNet, N etE) makes the methods robust
to other sampling patterns in testing. We also found that
In Table 1, FusionNet [56] achieves the best depth esti- N etM trained using different depth estimation methods has
mation performance under various of sampling masks. The similar sampling patterns. So simultaneously training the
proposed global and location information fusion is effective sampling and reconstruction networks improves the results
and the network size is considerably larger than N etE [43]. slightly.
Best depth sampling and estimation results are obtained us- In Figure 8, we visually compare the end to end depth
ing N etM sampling and FusionNet depth estimation under sampling and reconstruction results. In the 2 test scenes,
all sampling rates. It is noted that N etM is trained jointly N etM ∗ with FusionNet* properly sample and reconstruct
distant and thin objects, resulting in the best accuracy com-
pared to other methods. With the developing depth esti-
c Sampling Reconstruction MAE RMSE
SPS SPS 406.3 1264.2 mation algorithms, we can integrate better depth estimation
1%
DAL DAL 550.3 1566.7 methods into our system. We show in Section 4.2 that the
N etM FusionNet 285.0 939.4
N etM ∗ FusionNet* 284.6 932.6 performance advantages of N etM can generalize well to
SPS SPS 812.7 2192.3 other than N etE depth estimation methods.
DAL DAL 597.7 1667.8
0.25%
N etM FusionNet 404.3 1239.7
N etM ∗ FusionNet* 402.9 1229.4 5. Conclusion
SPS SPS 1668.6 3891.9
DAL DAL 789.1 2104.0
0.0625%
N etM FusionNet 634.9 1732.3
In this paper, we presented a novel adaptive depth sam-
N etM ∗ FusionNet* 631.5 1721.1 pling algorithm based on DL. The mask generation network
N etM is trained along with the depth completion network
Table 3. End to end depth estimation results comparison. N etE to predict the optimal sampling locations based on

4328

an input RGB image. Experiments demonstrate the effec- work. In Advances in neural information processing systems,
tiveness of the proposed N etM . Higher depth estimation pages 2366–2374, 2014.
accuracy is achieved by N etM under various depth com- [13] Yuval Eldar, Michael Lindenbaum, Moshe Porat, and
pletion algorithms. We also show that best end to end per- Yehoshua Y Zeevi. The farthest point strategy for progres-
formance is achieved by N etM with a state-of-the-art depth sive image sampling. IEEE Transactions on Image Process-
completion algorithm. Such adaptive depth sampling strat- ing, 6(9):1305–1315, 1997.
egy enables more efficient depth sensing and overcomes the [14] Sergi Foix, Guillem Alenya, and Carme Torras. Lock-in
 time-of-flight (tof) cameras: a survey. IEEE Sens. J., 11(9),
trade-off between frame-rate, resolution, and range in an
 2011.
active depth sensing system (such as LiDAR and sparse dot
 [15] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-
pattern structured light sensor).
 manghelich, and Dacheng Tao. Deep ordinal regression net-
 work for monocular depth estimation. In Proceedings of the
References IEEE Conference on Computer Vision and Pattern Recogni-
 tion, pages 2002–2011, 2018.
 [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien
 [16] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
 Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpix-
 Urtasun. Vision meets robotics: The kitti dataset. The Inter-
 els compared to state-of-the-art superpixel methods. IEEE
 national Journal of Robotics Research, 32(11):1231–1237,
 transactions on pattern analysis and machine intelligence,
 2013.
 34(11):2274–2282, 2012.
 [17] Emil Julius Gumbel. Statistical theory of extreme values and
 [2] Ibraheem Alhashim and Peter Wonka. High quality monoc-
 some practical applications. NBS Applied Mathematics Se-
 ular depth estimation via transfer learning. arXiv preprint
 ries, 33, 1954.
 arXiv:1812.11941, 2018.
 [18] Jungong Han, Ling Shao, Dong Xu, and Jamie Shotton. En-
 [3] Jonathan T Barron and Ben Poole. The fast bilateral solver.
 hanced computer vision with microsoft kinect sensor: A re-
 In European Conference on Computer Vision, pages 617–
 view. IEEE transactions on cybernetics, 43(5):1318–1334,
 632. Springer, 2016.
 2013.
 [4] Alexander W Bergman, David B Lindell, and Gordon Wet-
 [19] Simon Hawe, Martin Kleinsteuber, and Klaus Diepold.
 zstein. Deep adaptive lidar: End-to-end optimization of sam-
 Dense disparity maps from sparse disparity measurements.
 pling and depth completion at low sampling rates. In 2020
 In 2011 International Conference on Computer Vision, pages
 IEEE International Conference on Computational Photogra-
 2126–2133. IEEE, 2011.
 phy (ICCP), pages 1–11. IEEE, 2020.
 [20] HoloLens. (microsoft) 2020. Retrieved from
 [5] Robert Bridson. Fast poisson disk sampling in arbitrary di- https://www.microsoft.com/en-us/hololens.
 mensions. SIGGRAPH sketches, 10:1, 2007.
 [21] Iris AM Huijben, Bastiaan S Veeling, and Ruud JG van
 [6] Zhao Chen, Vijay Badrinarayanan, Gilad Drozdov, and An- Sloun. Deep probabilistic subsampling for task-adaptive
 drew Rabinovich. Estimating depth from rgb and sparse compressed sensing. In International Conference on Learn-
 sensing. In Proceedings of the European Conference on ing Representations, 2019.
 Computer Vision (ECCV), pages 167–182, 2018.
 [22] Iris AM Huijben, Bastiaan S Veeling, and Ruud JG van
 [7] Robert L Cook. Stochastic sampling in computer graphics. Sloun. Learning sampling and model-based signal recovery
 ACM Transactions on Graphics (TOG), 5(1):51–72, 1986. for compressed sensing mri. In ICASSP 2020-2020 IEEE
 [8] Qiqin Dai, Henry Chopp, Emeline Pouyet, Oliver Cossairt, International Conference on Acoustics, Speech and Signal
 Marc Walton, and Aggelos Katsaggelos. Adaptive image Processing (ICASSP), pages 8906–8910. IEEE, 2020.
 sampling using deep learning and its application on x-ray [23] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan
 fluorescence image reconstruction. IEEE Transactions on Yang, and Jan Kautz. Superpixel sampling networks. In
 Multimedia, 22(10):2564–2578, 2020. Proceedings of the European Conference on Computer Vi-
 [9] Laurent Demaret, Nira Dyn, and Armin Iske. Image com- sion (ECCV), pages 352–368, 2018.
 pression by linear splines over adaptive triangulations. Sig- [24] Eric Jang, Shixiang Gu, and Ben Poole. Categorical
 nal Processing, 86(7):1604–1616, 2006. reparameterization with gumbel-softmax. arXiv preprint
[10] Gilad Drozdov, Yevgengy Shapiro, and Guy Gilboa. Robust arXiv:1611.01144, 2016.
 recovery of heavily degraded depth measurements. In 2016 [25] Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier
 Fourth International Conference on 3D Vision (3DV), pages Perrotton, and Fawzi Nashashibi. Sparse and dense data with
 56–65. IEEE, 2016. cnns: Depth completion and semantic segmentation. In 2018
[11] David Eigen and Rob Fergus. Predicting depth, surface nor- International Conference on 3D Vision (3DV), pages 52–60.
 mals and semantic labels with a common multi-scale con- IEEE, 2018.
 volutional architecture. In Proceedings of the IEEE inter- [26] Takeo Kanade, Atsushi Yoshida, Kazuo Oda, Hiroshi Kano,
 national conference on computer vision, pages 2650–2658, and Masaya Tanaka. A stereo machine for video-rate dense
 2015. depth mapping and its new applications. In Proceedings
[12] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map CVPR IEEE Computer Society Conference on Computer Vi-
 prediction from a single image using a multi-scale deep net- sion and Pattern Recognition, pages 196–202. IEEE, 1996.

 4329

[27] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth transfer: The International Journal of Robotics Research, 38(8):935–
Depth extraction from video using non-parametric sampling. 980, 2019.
IEEE transactions on pattern analysis and machine intelli- [42] Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac
gence, 36(11):2144–2158, 2014. Karaman. Self-supervised sparse-to-dense: Self-supervised
[28] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth trans- depth completion from lidar and monocular camera. In
fer: Depth extraction from videos using nonparametric sam- 2019 International Conference on Robotics and Automation
pling. In Dense Image Correspondences for Computer Vi- (ICRA), pages 3288–3295. IEEE, 2019.
sion, pages 173–205. Springer, 2016. [43] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth
[29] Diederik P Kingma and Jimmy Ba. Adam: A method for prediction from sparse depth samples and a single image. In
stochastic optimization. arXiv preprint arXiv:1412.6980, 2018 IEEE International Conference on Robotics and Au-
2014. tomation (ICRA), pages 1–8. IEEE, 2018.
[30] Jason Ku, Ali Harakeh, and Steven L Waslander. In defense [44] Roberta Piroddi and Maria Petrou. Analysis of irregularly
of classical image processing: Fast depth completion on the sampled data: A review. Advances in Imaging and Electron
cpu. In 2018 15th Conference on Computer and Robot Vision Physics, 132:109–167, 2004.
(CRV), pages 16–22. IEEE, 2018. [45] Francesco Pittaluga, Zaid Tasneem, Justin Folden, Brevin
[31] Alexandr Kuznetsov, Nima Khademi Kalantari, and Ravi Ra- Tilmon, Ayan Chakrabarti, and Sanjeev J Koppal. To-
mamoorthi. Deep adaptive sampling for low sample count wards a mems-based adaptive lidar. arXiv preprint
rendering. In Computer Graphics Forum, volume 37, pages arXiv:2003.09545, 2020.
35–44. Wiley Online Library, 2018.
[46] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun,
[32] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- and Jiaya Jia. Geonet: Geometric neural network for joint
erico Tombari, and Nassir Navab. Deeper depth prediction depth and surface normal estimation. In Proceedings of the
with fully convolutional residual networks. In 2016 Fourth IEEE Conference on Computer Vision and Pattern Recogni-
international conference on 3D vision (3DV), pages 239– tion, pages 283–291, 2018.
248. IEEE, 2016.
[47] Siddavatam Rajesh, K Sandeep, and RK Mittal. A fast
[33] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and
progressive image sampling using lifting scheme and non-
Il Hong Suh. From big to small: Multi-scale local planar
uniform b-splines. In 2007 IEEE International Symposium
guidance for monocular depth estimation. arXiv preprint
on Industrial Electronics, pages 1645–1650. IEEE, 2007.
arXiv:1907.10326, 2019.
[48] Giovanni Ramponi and Sergio Carrato. An adaptive irregu-
[34] Anat Levin, Dani Lischinski, and Yair Weiss. Colorization
lar sampling algorithm and its application to image coding.
using optimization. In ACM SIGGRAPH 2004 Papers, pages
Image and Vision Computing, 19(7):451–460, 2001.
689–694. 2004.
[35] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. [49] German Ros, Laura Sellart, Joanna Materzynska, David
Kammel, J. Z. Kolter, D. Langer, O. Pink, V. Pratt, M. Sokol- Vazquez, and Antonio M Lopez. The synthia dataset: A large
sky, G. Stanek, D. Stavens, A. Teichman, M. Werling, and S. collection of synthetic images for semantic segmentation of
Thrun. Towards fully autonomous driving: Systems and al- urban scenes. In Proceedings of the IEEE conference on
gorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), computer vision and pattern recognition, pages 3234–3243,
pages 163–168, June 2011. 2016.
[36] A Shu Lin, B Zhongxuan Luo, C Jielin Zhang, and D Emil [50] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Saucan. Generalized ricci curvature based sampling and jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
reconstruction of images. In 2015 23rd European Signal Aditya Khosla, Michael Bernstein, et al. Imagenet large
Processing Conference (EUSIPCO), pages 604–608. IEEE, scale visual recognition challenge. International journal of
2015. computer vision, 115(3):211–252, 2015.
[37] David B Lindell, Matthew O’Toole, and Gordon Wetzstein. [51] Ashutosh Saxena, Sung Chung, and Andrew Ng. Learning
Single-photon 3d imaging with deep sensor fusion. ACM depth from single monocular images. Advances in neural
Transactions on Graphics (TOG), 37(4):1–12, 2018. information processing systems, 18:1161–1168, 2005.
[38] Jianxiong Liu, Christos Bouganis, and Peter YK Cheung. [52] Shreyas S Shivakumar, Ty Nguyen, Ian D Miller, Steven W
Kernel-based adaptive image sampling. In 2014 Interna- Chen, Vijay Kumar, and Camillo J Taylor. Dfusenet: Deep
tional Conference on Computer Vision Theory and Applica- fusion of rgb and sparse depth information for image guided
tions (VISAPP), volume 1, pages 25–32. IEEE, 2014. dense depth completion. In 2019 IEEE Intelligent Trans-
[39] Lee-Kang Liu, Stanley H Chan, and Truong Q Nguyen. portation Systems Conference (ITSC), pages 13–20. IEEE,
Depth reconstruction from sparse samples: Representation, 2019.
algorithm, and sampling. IEEE Transactions on Image Pro- [53] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
cessing, 24(6):1983–1996, 2015. Fergus. Indoor segmentation and support inference from
[40] Jiajun Lu and David Forsyth. Sparse depth super resolution. rgbd images. In European conference on computer vision,
In Proceedings of the IEEE Conference on Computer Vision pages 746–760. Springer, 2012.
and Pattern Recognition, pages 2245–2253, 2015. [54] Ali Taimori and Farokh Marvasti. Adaptive sparse image
[41] Fangchang Ma, Luca Carlone, Ulas Ayaz, and Sertac Kara- sampling and recovery. IEEE Transactions on Computa-
man. Sparse depth sensing for resource-constrained robots. tional Imaging, 4(3):311–325, 2018.

4330

[55] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
 Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
 In 2017 international conference on 3D Vision (3DV), pages
 11–20. IEEE, 2017.
[56] Wouter Van Gansbeke, Davy Neven, Bert De Brabandere,
 and Luc Van Gool. Sparse and noisy lidar completion with
 rgb guidance and uncertainty. In 2019 16th International
 Conference on Machine Vision Applications (MVA), pages
 1–6. IEEE, 2019.
[57] Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman,
 and Vivienne Sze. Fastdepth: Fast monocular depth esti-
 mation on embedded systems. In 2019 International Confer-
 ence on Robotics and Automation (ICRA), pages 6101–6108.
 IEEE, 2019.
[58] Adam Wolff, Shachar Praisler, Ilya Tcenov, and Guy Gilboa.
 Super-pixel sampler: a data-driven approach for depth sam-
 pling and reconstruction. In 2020 IEEE International Con-
 ference on Robotics and Automation (ICRA), pages 2588–
 2594. IEEE, 2020.
[59] Alex Wong, Xiaohan Fei, Stephanie Tsuei, and Stefano
 Soatto. Unsupervised depth completion from visual in-
 ertial odometry. IEEE Robotics and Automation Letters,
 5(2):1899–1906, 2020.
[60] Fengting Yang, Qian Sun, Hailin Jin, and Zihan Zhou. Su-
 perpixel segmentation with fully convolutional networks. In
 Proceedings of the IEEE/CVF Conference on Computer Vi-
 sion and Pattern Recognition, pages 13964–13973, 2020.
[61] Yanchao Yang, Alex Wong, and Stefano Soatto. Dense depth
 posterior (ddp) from single image and sparse range. In Pro-
 ceedings of the IEEE Conference on Computer Vision and
 Pattern Recognition, pages 3353–3362, 2019.

 4331

You can also read