AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드

Page created by Charlotte Knight
 
CONTINUE READING
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Copyright of figures and other materials in the paper belongs to original authors.

AdaBins:
Depth Estimation using Adaptive Bins

Shariq Farooq Bhat(KAUST) et al. Presented by JIN HONGYU
CVPR 2020
 2021.06.10

 Computer Graphics @ Korea University
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Outline

 • Introduction
 • Related Works
 • Methodology
 • Experiments
 • Conclusion

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 2
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Introduction
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Introduction

 • Depth Estimation using
 Adaptive Bins
 ▪ High quality dense depth
 map from single RGB input
 image

 ▪ Start with Encoder-Decoder
 CNN architecture

 ▪ Propose a transformer-
 based architecture block
 • Divides the depth range
 into adaptive bins

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 4
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Introduction
 Motivation – A Conjecture
 • Conjecture:
 ▪ Current architectures do not perform enough global analysis of
 the output values
 ▪ Convolutional layers only process global information once the
 tensors reach a very low spatial resolution at or near the
 bottleneck

 “TernausNet: U-Net with VGG11 Encoder Pre-
 Trained on ImageNet for Image Segmentation”,
 Vladimir Iglovikov(Lyft Inc.) and Alexey Shvets(MIT) |
 arXiv:1801.05746 2018

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 5
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Introduction
 Motivation – Global Processing
 • Global processing should be a lot more powerful when done at
 high resolution

 • General idea:
 ▪ Perform a global statistical analysis of the output
 • Output is from a traditional encoder-decoder architecture
 ▪ Refine the output with a learned post-processing building block
 • Operates at the highest resolution

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 6
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Introduction
 Depth Distribution
 • Depth distribution corresponding to different RGB inputs can vary
 to a large extent
 ▪ makes depth regression in an end-to-end manner an even more
 difficult task
 • Approach: Adaptively Focus
 ▪ Let the network learns to adaptively focus on regions of the
 depth range which are more probable to occur in the scene of
 the input image

 Figure 1 (part)
Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 7
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Introduction
 Contribution
 • Propose an architecture building block that performs global
 processing of the scene’s information
 ▪ Divide the predicted depth range into bins where the bin widths
 change per image
 ▪ The final depth estimation is a linear combination of the bin
 center values

 • Decisive improvement for supervised single image depth
 estimation across all metrics
 ▪ For the two most popular datasets: NYU, KITTI

 • Analyze and investigate different modifications on the proposed
 AdaBins block
 ▪ study their effect on the accuracy of the depth estimation

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 8
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Related Works
AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
Related Works
 Monocular Depth Estimation
 • “From Big to Small: Multi-Scale Local Planar Guidance for
 Monocular Depth Estimation”
 ▪ Jin Han Lee (Hanyang University) et al. | arXiv:1907.10326 2019
 • “Guiding Monocular Depth Estimation Using Depth-Attention
 Volume”
 ▪ Lam Huynh(University of Oulu) et al. | ECCV 2020

 BTS DAV
Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 10
Related Works
 Encoder-Decoder
 • Used in many vision related problems
 ▪ Image segmentation, optical flow estimation, image restoration
 ▪ Have shown great success both in the supervised and the
 unsupervised setting of the depth estimation problem

 • This paper adapted baseline encoder-decoder network architecture
 of DenseDepth
 ▪ “High Quality Monocular Depth Estimation via Transfer Learning”
 • Ibraheem Alhashim and Peter Wonka (KAUST)| arXiv:1812.11941 2018

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 11
Related Works
 Transformer
 • Traditionally used in Natural Language Processing(NLP)
 ▪ “Attention Is All You Need”
 • Ashish Vaswani(Google Brain) et al. | NIPS 2017
 • Recently also used in Computer Vision tasks
 ▪ “End-to-End Object Detection with Transformers”
 • Carion, Nicolas, et al. | ECCV 2020
 Transformer in
 “Attention Is All You Need”

 Transformer in
 “End-to-End Object Detection with Transformers”
Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 12
Methodology
Methodology
 Motivation
 • Performance improvement by transforming depth regression task into
 classification task
 ▪ “Deep Ordinal Regression Network for Monocular Depth Estimation”
 • Huan Fu(The University of Sydney) et al. | Proceedings of the IEEE
 Conference on CVPR 2018
 • Divide the depth range into a fixed number of bins of predetermined
 width
 • Multiple limitations of upper paper solved in this paper by:
 ▪ Compute adaptive bins
 • Dynamically change depending on input features
 ▪ Predict the final depth values as a linear combination of bin centers
 • combine the advantages of classification with the advantages of
 depth-map regression
 ▪ Compute information globally at a high resolution
 • Compared to other architectures like DAV

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 14
Methodology
 Adabins Design – Discretized Depth
 • Discretize the depth interval D = (dmin, dmax) into N bins.
 ▪ The bin widths b are adaptively computed for each image
 • Better than fixed bin width or trained-fixed bin width

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 15
Methodology
 Adabins Design – Linear Combination
 • Depth discretization artifacts
 ▪ Caused by Discretized Depth Interval
 • Predict the final depth as a linear combination of bin centers
 ▪ Enabling the model to estimate smoothly varying depth values

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 16
Methodology
 Adabins Design – Attention Block
 • Performing global processing using attention block
 ▪ In this paper:
 • Apply on high resolution
 • Encoder –> Decoder –> Attention

 ▪ In other architectures:
 • Apply on low resolution
 • Encoder –> Attention –> Decoder

 “Guiding Monocular Depth Estimation Using
 Depth-Attention Volume”
Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 17
Methodology
 Adabins Design – Encoder-Decoder
 • Build on the simplest possible architecture
 ▪ To isolate the effects of proposed AdaBins concept
 ▪ Build on a modern encoder-decoder
 • Using EfficientNet B5 as encoder backbone
 • “Efficientnet: Rethinking model scaling for convolutional neural
 networks”
 ▪ Tan, Mingxing, and Quoc Le | ICML 2019

 Efficientnet B5 Architecture
 Image from Vardan Agarwal’s “Complete Architectural Details of all EfficientNet Models”

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 18
Methodology
 Architecture - Overview
 • Our architecture consists of two major components:
 ▪ An encoder-decoder block
 • Encoder: pretrained EfficientNet B5
 • Decoder: a standard feature up-sampling decoder
 ▪ 4 up-sampling layers
 ▪ AdaBins : Adaptive bin-width estimator block

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 19
Methodology
 Architecture – Encoder-Decoder
 • Primarily based on a simple depth regression network
 ▪ “High Quality Monocular Depth Estimation via Transfer Learning”
 • Ibraheem Alhashim and Peter Wonka (KAUST) | arXiv:1812.11941
 ▪ Modifications:
 • Encoder:
 ▪ DenseNet -> EfficientNet B5
 • A different appropriate loss function
 • Decoder output:
 ▪ Final depth map -> Decoded features
 ▪ h × w × 1 -> h × w × Cd

 Output Decoded Features
 h×w×Cd

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 20
Methodology
 Architecture – Adabins
 • Adabins module: Key contribution of this paper
 ▪ Input:
 • Decoded features: h × w × Cd
 ▪ Output:
 • Depth map: h × w
 ▪ Due to hardware limitation:
 • h=H/2,w=W/2
 • Facilitate better learning with larger batch sizes
 • Final depth map is simply bilinearly up-sampling to H × W × 1

 Input Decoded Features
 h×w×Cd

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 21
Methodology
 Adabins – Mini-ViT
 • Estimating sub-intervals within the depth range D
 ▪ Requires:
 • local structural information
 • global distributional information
 ▪ Using global attention method
 • Usually expensive: Memory & Complexity
 ▪ Especially at higher resolution
 • Mini-ViT: A more efficient alternative

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 22
Methodology
 Adabins – Vision Transformer
 • Vision Transformer (ViT):
 ▪ “An image is worth 16x16 words: Transformers for image
 recognition at scale”
 • Alexey Dosovitskiy(Google Brain) et al. | ICLR 2021
 ▪ Applying a standard Transformer of NLP directly to images
 • Split an image into patches
 • Use sequence of linear embeddings of patches as input
 • Patches are treated the same way as tokens (words) in NLP

 Vision Transformer in
 “An image is worth 16x16 words: Transformers for image
 recognition at scale” Mini-ViT in this paper
Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 23
Methodology
 Adabins – Patch Embedding
 • Patch Embedding:
 ▪ Transfer decoded features into fixed-sized patches
 • Transformer requires fixed-sized input
 ▪ Pass decoded features to Embedding Conv
 • Kernel size: p × p
 • Number of output channel: E
 • Thus, Embedding Conv output size: h/p × w/p × E
 ▪ Reshape into flattened tensor
 • xp ∈ ℝS × E
 hw
 ▪ S= : effective sequence length
 p2
 ▪ Positional embedding:
 • Learnable 1D position embeddings

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 24
Methodology
 Adabins – Transformer Encoder
 • Transformer Encoder:
 ▪ Input:
 • Patch embeddings
 ▪ Output:
 • Output embeddings : xo ∈ ℝS × E
 ▪ Pass first row of output embeddings into an MLP head
 • 3 full connected layers
 ▪ leakyReLU among each layer
 • MLP output: N-dimensional vector b’
 ▪ Normalize b’ into b (Eq. 1)
 ViT

 • Let b sums up to 1
 ▪ Force network to focus
 • = 10-3
 ▪ Ensure each bin-width > 0

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 25
Methodology
 Adabins – Range Attention Maps
 • Use part of output embeddings as 1×1 conv. kernel
 ▪ Using part: from 2nd row to (C+1)th row
 • Pass decoded features into a 3×3 conv. Layer
 • Convolved by upper 1×1 conv. Kernel
 ▪ Equivalent to calculating the Dot-Product
 • Pixel-wise features treated as ‘keys’
 • Output embeddings as ‘queries’ ViT
 ▪ Integrate adaptive global information into the local information
 • The result is Range Attention Maps ℛ
 ▪ ℛ and b are used to get final depth
 • Implementation details of Mini-ViT

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 26
Methodology
 Adabins – Hybrid Regression
 • Pass Range-Attention Maps into a 1×1 conv. layer
 ▪ Get N-channels
 • N: Number of bins
 • Softmax activation
 ▪ pk : The Softmax score of each pixel
 • k = 1, … , N
 ▪ pk is considered as probabilities over depth-bin-centers c(b)
 • c(b) = {c(b1), c(b2), …, c(bN)}
 • c(b) is calculated by Eq. 2

 • Final depth value ሚ is linear combination of pk and c(b) (Eq. 3)
 ▪ To avoid discretization artifacts

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 27
Methodology
 Loss Function
 • Pixel-wise depth loss: (Eq. 4)
 ▪ A scaled version of the Scale-Invariant loss (SI)

 • gi = log ෩ – log d 
 • di : ground truth
 • T: number of pixels having valid ground truth value
 • = 0.85, = 10
 • Bin-center density loss: (Eq. 5)
 ▪ Use the bi-directional Chamfer Loss
 ▪ “A Point Set Generation Network for 3D Object Reconstruction from a
 Single Image”
 • Haoqiang Fan(Tsinghua University) et al. | CVPR 2017

 • Final loss: (Eq. 6)
 ▪ = 0.1

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 28
Experiments
Experiments
 Datasets
 • Dataset:
 ▪ NYU Depth v2: Indoor scenes
 • Image & depth map (640 × 480)
 • 50K subset is used in training for this paper
 ▪ KITTI: Outdoor scenes (captured on moving vehicle)
 • Stereo image(1241 × 376) & 3D laser scanned data (low density)
 • 26K subset is used in training for this paper
 ▪ SUN RGB-D: Indoor scenes
 • Data captured by 4 different sensors
 • Not used for training

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 30
Experiments
 Evaluation metrics
 • Evaluation metrics:
 ▪ Standard six metrics used in prior work
 • Average relative error
 • Root mean squared error
 • Average (log10) error
 • Threshold accuracy ( i):
 ▪ Threshold = 1.25, 1.252, 1.253
 ▪ 2 more for KITTI:
 • Squared Relative Difference
 • RMSE log

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 31
Experiments
 Implementation details
 • Platform: Pytorch
 • Optimizer: AdamW
 ▪ Weighted-decay: 10-2
 • 1-cycle policy for the learning rate with max_lr = 3.5 × 10-4
 ▪ For the first 30%: Linear warm-up from max_lr/25 to max_lr
 ▪ Cosine annealing to max max_lr/75
 • Total number of epochs: 25
 • Batch size: 16
 • 20 min per epoch
 ▪ on single node with 4 NVIDIA V100 32GB GPU
 • Main model: 78M parameters
 ▪ CNN encoder: 28M
 ▪ CNN decoder: 44M
 ▪ Adabins module: 5.8M

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 32
Experiments
 Results

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 33
Experiments
 Ablation Study - Adabins & Bin Types
 • Adabins & bin types:

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 34
Experiments
 Ablation Study – Number of Bins
 • Number of bins:

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 35
Experiments
 Ablation Study – Loss Function
 • Loss Function:

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 36
Experiments
 Test on Webcam
 • Although Adabins is not designed for real-time application, it is
 relatively faster than many other non-real-time archietecture
 ▪ Intel Core i7-7700K
 ▪ Nvidia Geforce GTX 1080

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 37
Conclusion
Conclusion

 • We introduced a new architecture block, called AdaBins for depth
 estimation from a single RGB image.

 • AdaBins leads to a decisive improvement in the state of the art
 for the two most popular datasets, NYU and KITTI

 • Future work:
 ▪ Investigate if global processing of information at a high resolution can
 also improve performance on other tasks
 • segmentation, normal estimation, and 3D reconstruction from multiple
 images

Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 39
Thanks to Your Audience!
You can also read