Neural Architecture Search for Efficient Uncalibrated Deep Photometric Stereo

Page created by David Daniels

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Neural Architecture Search for Efficient Uncalibrated Deep Photometric Stereo

Francesco Sarno1 , Suryansh Kumar1 , Berk Kaya1 , Zhiwu Huang1 , Vittorio Ferrari2 , Luc Van Gool1,3
Computer Vision Lab, ETH Zürich1 , Google Research2 , KU Leuven3

Abstract As a result, several robust methods [67, 47, 23], and realis-
arXiv:2110.05621v1 [cs.CV] 11 Oct 2021

tic Bidirectional Reflectance Distribution Function (BRDF)
We present an automated machine learning approach based methods [16, 11, 17, 21, 58] were proposed. Robust
for uncalibrated photometric stereo (PS). Our work aims methods treat non-Lambertian effects as outliers, and popu-
at discovering lightweight and computationally efficient PS lar realistic BRDF models confine to isotropic BRDF mod-
neural networks with excellent surface normal accuracy. eling of non-Lambertian surfaces [22, 58]. Hence, these
Unlike previous uncalibrated deep PS networks, which are methods can only model the reflectance property of a re-
handcrafted and carefully tuned, we leverage differentiable stricted class of materials. In general, modeling surfaces
neural architecture search (NAS) strategy to find uncali- with unknown reflectance properties is challenging.
brated PS architecture automatically. We begin by defining In recent years, deep neural networks have significantly
a discrete search space for a light calibration network and improved the performance of many computer vision tasks,
a normal estimation network, respectively. We then perform including photometric stereo. Their powerful ability to
a continuous relaxation of this search space and present learn from data has helped in modeling surfaces with un-
a gradient-based optimization strategy to find an efficient known reflectance properties, which was a challenge for
light calibration and normal estimation network. Directly traditional PS methods. Further, neural networks can im-
applying the NAS methodology to uncalibrated PS is not plicitly learn the image formation process and global illu-
straightforward as certain task-specific constraints must be mination effects from data, which classical algorithms can-
satisfied, which we impose explicitly. Moreover, we search not pursue. As a result, several deep learning architectures
for and train the two networks separately to account for were proposed for PS [20, 63, 8, 7, 41, 40, 19]. Hence,
the Generalized Bas-Relief (GBR) ambiguity. Extensive ex- by leveraging a deep neural network, we can overcome the
periments on the DiLiGenT dataset show that the automat- shortcoming of PS due to the Lambertian object assump-
ically searched neural architectures performance compares tion. However, these methods still rely on the other assump-
favorably with the state-of-the-art uncalibrated PS methods tion of calibrated setting i.e., the light source directions are
while having a lower memory footprint. given at test time, limiting their practical application. Ac-
cordingly, uncalibrated deep PS methods that can provide
results comparable to calibrated PS networks are becoming
1. Introduction more and more popular [6, 27, 9].
Photometric stereo (PS) aims at recovering an object’s The impressive results demonstrated by deep uncali-
surface normals from its light varying images captured from brated PS methods have a few critical issues: the network
a fixed viewpoint. Although range scanning methods [45, architecture is manually designed, and therefore, such net-
35, 46, 44], multi-view methods [15, 32, 33, 29, 30, 31] and works are typically not optimally efficient and have a large
single image dense depth estimation methods [54, 13, 34] memory footprint [27, 6, 8, 9]. Moreover, the authors of
can recover the object’s surface normals, photometric stereo such networks conduct many experiments to explore the ef-
is excellent at capturing high-frequency surface details such fect of empirically selected operations and tune hyperpa-
as scratches, cracks, and dents from images. Therefore, it is rameters. But, we know from the popular research in ma-
a favored choice for fine-detailed surface recovery in many chine learning that not only the type of operation but some-
scientific and engineering areas such as forensics [53] and times their placement (ordering) matters for performance
molding [70]. [18, 73]. And therefore, a separate line of research known as
Seminal work on PS assumes a Lambertian object un- Neural Architecture Search (NAS) has gained tremendous
der calibrated setting i.e., the directions of the light sources interest to tackle such challenges in architecture design.
are known [65]. Firstly, the Lambertian object assumption NAS methods automate the design process, greatly reduc-
does not hold for surfaces with general reflectance property. ing human effort in searching for an efficient network de-

sign [1]. NAS algorithms have shown great success in many our NAS framework. Utilizing previous methods knowl-
high-level computer vision tasks such as object detection edge in the architecture design process not only helps in
[61, 72], image classification [66], image super-resolution reducing the architecture’s search time but also provides
[68], action recognition [60], and semantic segmentation an optimal architecture with better performance accuracy
[37]. Yet, its potential for low-level 3D computer vision [6, 27]. Before we describe the NAS modeling of our prob-
problem such as uncalibrated PS remains unexplored. lem, we define the classical photometric stereo setup.
Among architecture search methods, evolutionary algo- Consider an orthographic camera observing a rigid ob-
rithms [51, 52] and reinforcement learning-based methods ject from a given viewpoint v = (0, 0, 1)T . For PS setup,
[74, 75] are computationally expensive and need thousands the images are captured by firing one unique directional
of GPU hours to find architecture. Hence they are not suit- light source per image. Let I ∈ Rm×n be the measure-
able for our problem. Instead, we adhere to the cell-based ment matrix comprising of n images with m pixels stacked
differentiable NAS formulation. It has proven itself to be as column vectors. Let L ∈ R3×n and N ∈ R3×m de-
computationally efficient and demonstrated encouraging re- note all the light sources and surface normals respectively.
sults for many high-level vision problems [39, 36]. How- Then, the image formation model under Lambertian surface
ever, in those applications, differentiable NAS is used with- assumption is formulated as follows:
out any task-specific treatment. Unfortunately, this will not
work for the uncalibrated PS problem. There exists GBR I = ρ · N T L + E. (1)
ambiguity [4] due to the lack of light source information.
Here, ρ ∈ R is the diffuse albedo and E accounts for er-
Moreover, certain task-specific constraints must be satis-
ror due to shadows, specularities, or noise. When all non-
fied (e.g., unit normal, unit light source direction), and the
Lambertian effects are ignored, solving Eq.(1) can recover
method must operate on unordered image sets. Unlike typi-
the actual surface up to a GBR transformation, such that
cal NAS-based methods, we incorporate human knowledge
I = (G−T N̄ )T (GL). Here, N̄ ∈ R3×m denotes the albedo
in our search strategy to address those challenges. To re-
scaled normals and G ∈ R3×3 is the transformation matrix
solve GBR ambiguity, we first search for an efficient light
with 3 unknown parameters [4, 5]. It indicates that there are
calibration network, followed by a normal estimation net-
many solutions leading up to the same image. Neverthe-
work’s search [6]. To handle PS-related constraints, we fix
less, it is well known that specularities [16, 12], interreflec-
some network layers and define our discrete search space
tions [5], albedo distributions [3, 56] and BRDF properties
for both networks accordingly. We model our PS architec-
[62, 69, 42] provide useful cues for disambiguation. How-
ture search space via a continuous relaxation of the discrete
ever, such cues are not well exploited in a single-stage net-
search space, which can be optimized efficiently using a
work designed for regressing per-pixel normals, and there-
gradient-based algorithm.
fore, we adhere to use two different neural networks fol-
We evaluated our method’s performance on the DiLi- lowing Chen et al. [6]. We first learn the light sources from
GenT benchmark PS dataset [59]. The experiments re- images by training a light calibration network in a super-
veal that our approach discovers lightweight architectures, vised way. Then, we use its results at inference time for the
which provides results comparable to the state-of-the-art normal estimation network to predict the surface normals.
manually designed deep uncalibrated networks [8, 6, 27]. Unlike other uncalibrated deep PS methods, our approach
This paper makes the following contributions: allows automatic search for the optimal architecture both
• We propose the first differentiable NAS-based framework for the light calibration and normal estimation networks.
to solve uncalibrated photometric stereo problem. 2.1. Architecture Search for Uncalibrated PS
• Our architecture search methodology considers the task-
specific constraints of photometric stereo during search, Leveraging the recent one-shot cell-based NAS method
train, and test time to discover meaningful architecture. i.e., DARTS [39], we first define different discrete search
• We show that automatically designed architecture out- spaces for light calibration and normal estimation networks.
performs the existing traditional uncalibrated PS perfor- Next, we perform a continuous relaxation of these search
mance and compares favorably against hand-crafted deep spaces, leading to differentiable bi-level objectives for opti-
PS network with significantly less parameters. mization. We perform an end-to-end architecture search for
light calibration and normal estimation networks separately
2. Proposed Method to obtain optimal architectures. Contrary to high-level vi-
sion problems such as object detection, image classification,
This section describes our task-specific neural architec- and others [39, 10, 75], directly applying the one-shot NAS
ture search (NAS) approach. We utilize the seminal clas- to existing uncalibrated PS networks [6, 27] may not neces-
sical photometric stereo formulation [65, 4] and previous sarily lead to a good solution. Unfortunately, for our task,
handcrafted deep neural network design [6] as the basis of a single end-to-end NAS seems challenging. It may lead

to unstable behavior due to GBR ambiguity [4]. And there- x⁰ xx⁰ ⁰ x⁰
 x⁰ xx⁰ ⁰ x⁰ x⁰ xx⁰ ⁰ x⁰ x⁰ xx⁰ ⁰ x⁰
fore, we search for an optimal light calibration first and then ? ?? ? (01) (01)
 (01) (01)
 (01) (01)
 o̅ o̅o̅ o̅(01) (01)
 o̅ o̅o̅ o̅ (01) (01)
 (01) (01) o oo o
search for a normal estimation network by keeping some of ? ?? ?
 o̅ (02)o̅o̅(02)
 (02) (02)
 o̅ o̅ (02)o̅o̅(02) (02) (02)
 o̅ o(02)oo(02)(02) (02)
 o
the necessary operations or layers fixed —such a strategy is x¹ x¹
 x¹ x¹ x¹ x¹
 x¹
 (12)x¹(12)
 (12) (12) x¹
 1 x¹1x¹
 1(12)x¹1 (12)
 (12) (12) x¹ x¹
 o x¹
 (12)
 ox¹
 (12)
 o (12) (03)
 (12) (03)
 ? ?? ?? ?? ? o ̅ o̅ o̅ o ̅ (03)
 o̅ o̅o̅ o̅ (03)
 (03) (03) o ̅ o̅ o̅ o̅ (03)
 o̅ o̅o̅ o̅ (03)
 (03) (03) o oo(03)o(03)
used in other NAS based applications [14]. The searched x² x²x² x² x² x²x² x² x² 2 x² 2x²
 2 x² 2 x² x²x² x²
architectures are then trained independently for inference. ? ?? ?? ?? ? o̅ (13)o̅o̅(13)
 (13) (13)
 o̅ o̅ (13)o̅o̅(13)
 (13) (13)
 o̅ o(13)oo(13)
 (13) (13)
 o (23) (23)
 o(23)oo(23) o
 o̅ (23)o̅o̅(23)
 (23) (23)
 o̅ o̅ (23)o̅o̅(23) (23) (23)
 o̅
• Background on Differentiable NAS. In recent years,
 x³ x³x³ x³ x³ x³x³ x³ x³ x³x³ x³ x³ x³x³ x³
Neural Architecture Search (NAS) has attracted a lot of at-
tention from the computer vision research community. The (a) (b) (c) (d)
goal of NAS is to automate the process of deep neural net- Figure 1: Illustration of a cell. (a) Initially, the optimal operations
work design. Among several promising approaches pro- ō(i,j) between nodes x(i) and x(j) are unknown. (b) Each node is
posed in the past [51, 75, 51, 39, 38, 26, 50], the DARTS computed by a mixture of candidate operations. (c) Architecture
[39] has shown promising outcomes due to its computa- encoding is obtained by solving the continuous relaxation of the
tional efficiency and differentiable optimization formula- search space. (d) Optimal cell obtained after selection of most
tion. So, in this paper, we use it to design an efficient deep probable candidate operation.
neural network to solve uncalibrated PS.
 DARTS searches for a computational cell from a set of
 an optimal architecture α using the validation loss with the
defined search spaces, which is a building block of the ar-
 weights ω that minimizes the training loss for a given α.
chitecture. Once the optimal cells are obtained, it is stacked
 This leads to following bi-level optimization problem.
to construct the final architecture for training and inference.
To find the optimal cell, we define search space O, that is a minimize Lval (ω ∗ (α), α);
set of possible candidate operations. The method first per- α
 (4)
forms continuous relaxation on the search spaces and then subject to: ω ∗ (α) = arg min Ltrain (ω, α)
 ω
searches for an optimal cell. A cell is a directed acyclic
graph (DAG) with N nodes and E edges. Each node is a la- where, Lval and Ltrain are the validation and training
tent feature map representation say x(i) for the ith node and losses respectively. This optimization problem is solved it-
each edge is associated with an operation say o(i,j) between eratively until convergence is reached. The architecture α
node i and node j (see Fig.1(a)). In a cell, each intermediate is updated by substituting the lower-level optimization gra-
node is computed from its preceding nodes as follows: dient approximation. Concretely, update α by descending
 X ∇α Lval (ω − ξ∇ω Ltrain (ω, α), α). Subsequently update
 x(j) = o(i,j) x(i)
 
 (2) ω by descending ∇ω Ltrain (ω, α), where:
 i 0 is the learning rate of the inner optimization. The
specific operation is replaced by the continuous relaxation idea is that, ω ∗ (α) is approximated with a single learning
of the search space by taking softmax over all the defined step which allows the searching process to avoid solving the
candidate operations as follows: inner optimization in Eq.(4) exactly. We refer this formula-
 tion as second-order approximation [39]. To speed up the
 (i,j)
 X exp(αo ) searching process, common practice is to apply first-order
 õ(i,j) (x) = o(x) (3)
 P (i,j) approximation by setting ξ = 0. For more details on the
 o∈O o0 ∈O exp(αo0 )
 bi-level optimization refer Liu et al. work [39].
Here, α(i,j) is a vector of dimension |O| which denotes the
 2.1.1 Our Cell Description
operation mixing weights on edge (i, j) (see Fig.1(b)). As
a result, the search task for DARTS reduces to a learning For our problem, we search for both light calibration and
set of continuous variable α(i,j) ∀ (i, j). The optimal archi- normal estimation networks. Our cells consist of two in-
tecture will be determined replacing each mixed operation put nodes, four intermediate nodes, and one output node
 (i,j)
ō(i,j) on edge (i, j) with: o(i,j) = arg maxo∈O αo cor- for both of the networks. Each cell at layer k uses the out-
responding to the operation which is the “most probable” put of two preceding cells (Ck−1 and Ck−2 ) at input nodes
among the ones listed in O (see Fig.1(c)-Fig.1(d)). The and outputs Ck by channel-wise concatenation of the fea-
introduced relaxation allows joint learning of architecture tures at the intermediate nodes. To adjust the spatial dimen-
α and its weight ω within the mixture of operations. So, sions, we define two cells i.e., normal cells and reduction
the goal of architecture search now becomes to search for cells. Normal cells preserve the spatial dimensions of the

input feature maps by applying convolution operations with size followed by batch-normalization [24]. As before, our
stride 1. The reduction cells use operations with stride 2 cells consist of two input nodes, four intermediate nodes,
adjacent to input nodes, reducing the spatial dimension by and one output node §2.1.1. Just for the initial cell, we use
half. Although the cell definition for both networks is the stem layers as its input for better search. These layers apply
same, the network-level search spaces are different due to fixed convolutions to enrich the initial cell input features.
the problem’s constraints. Next, we describe our procedure 2. Continuous relaxation and Optimization. We perform
to obtain optimal network architecture for uncalibrated PS. the continuous relaxation of our defined search space us-
ing Eq.(3) for differentiable optimization. During searching
2.2. Light Calibration Network phase, we perform alternating optimization over weights ω
Light calibration network predicts all the light source’s and architecture encoding values α as follows:
direction and intensity from a set of PS images. Here, we • Update network weights ω by ∇ω Ltrain (ω, α).
assume the object mask is known. One obvious way to esti- • Update architecture mixing weights α by ∇α Lval (ω −
mate light is to regress a set of images with the source direc- ξ∇ω Ltrain (ω, α), α). (see Eq.(5) )
tion vectors and intensities in a continuous space. However, Ltrain and Lval denote the loss computed over training and
converting this task into a classification problem is more fa- validation datasets, respectively. We use multi-class cross-
vorable for our purpose. It stems from the fact that learning entropy loss on azimuth, elevation, and intensity classes to
to classify light source directions to predefined bins of an- optimize our network [6]. The total light calibration loss is
gles is much easier than regressing the unit vector itself.
Further, using discretized light directions makes the net- Llight = Lφ + Lθ + Le (6)
work robust to small input variations.
We represent the light source direction in the upper- where, Lφ , Lθ , and Le are the losses for azimuth, el-
hemisphere by its azimuth φ ∈ [0, π] and elevation θ ∈ evation, and intensity respectively. We utilize the syn-
[−π/2, π/2] angles. We divide the angle spaces into 36 thetic Blobby and Sculpture datasets [8] for this optimiza-
evenly spaced bins (Kd = 36). Our network perform clas- tion where ground-truth labels for lighting are provided.
sification on azimuth and elevation separately. For the light Once the searching phase is complete, we convert the
intensities, we assign the values in the range of [0.2, 2] di- continuous architecture encoding values into a discrete ar-
vided uniformly into 20 bins (Ke = 20) [6]. chitecture. For that, we select the strongest operation on
(i,j)
NAS for Light Calibration Network. To perform NAS each edge (i, j) with: o(i,j) = arg maxo∈Olight αo .
for light calibration network, we use the backbone shown in We preserve only the strongest two operations preceding
Fig.2(a). The backbone consists of three main parts (i) local each intermediate node. We train our designed architecture
feature extractor (ii) aggregation layer and (iii) classifier. with optimal operations from scratch on the training dataset
The feature extraction layers provide image-specific infor- again to optimize weights before testing §3.1.
mation for each input image. The weights of these feature
extraction layers are shared among all input images. The
2.3. Normal Estimation Network
image-specific features are then aggregated to a global fea- We independently search for optimal normal estimation
ture representation with the max-pooling operation. Later, network using the backbone shown in Fig.2(b). To use the
global feature representation is combined with the image- light source information into the network, we first convert n
specific information and fed to the subsequent layers for light direction vectors into a tensor X ∈ Rn×3×h×w , where
classification. The fully connected layers provide softmax each 3-vector is repeated over spatial dimensions h and w.
probabilities for azimuth, elevation, and intensity values. This tensor is then concatenated with the input image to
We use the NAS algorithm to perform search only over form a tensor I ∈ Rn×6×h×w . Similar to the light cali-
the feature extraction layer and classifier layers for architec- bration network, we use a shared-weight feature extraction
ture search (shown with dashed box Fig.2(a)), while keep- block to process each input. After image-specific informa-
ing other layers fixed. For NAS to provide optimal archi- tion is extracted, we combine them in a fixed aggregation
tecture over the searchable blocks in the light-calibration layer with the max-pooling operation and obtain a global
network backbone, we define our search space as follows: representation. Keeping the aggregation layer fixed allows
1. Search Space. Our candidate operations set in search the network to operate on an arbitrary number of test images
space for light calibration network is composed of Olight = and improves robustness. The global information is finally
{“1 × 1 separable conv.”, “3 × 3 separable conv.”, “5 × used to regress the normal map, where a fixed normalization
5 separable conv.”, “skip connection”, “zero”}. The layer is used to satisfy the unit-length constraint.
“zero” operation indicates the lack of connection between NAS for Normal Estimation Network. Similar to light
two nodes. Each convolutional layer defined in the set first calibration network, the cells here consist of two input
applies ReLU [71] and then convolution with given kernel nodes, four intermediate nodes, and one output node. To

Connected

Deconv
Conv

Conv
Conv

Conv

Conv
Fully
Input 1 Output 1 Input 1

Searched Searched Searched

Aggregation
Architecture Architecture Architecture

Aggregation
×2 ×2 ×2 ×2 ×2 ×1 ×1

Normalize
Deconv
Layer

Conv

Conv
Layer
. . . . . . Output
. . . . . .
. . . . . .
Searched
Normal Map

Connected
Architecture
×2 ×1 ×1

Deconv
Conv
Conv

Conv
Conv

Fully

Conv
Input Output Input

Image Searched Searched Image, Mask and Searched
Light Direction
Mask ×2 Architecture ×2 Architecture ×2 ×2 and Intensity Light Source ×2 Architecture ×1 ×1

Feature Normal Cell Feature Reduction Cell Classifier Normal Cell Classifier Reduction Cell Feature Normal Cell Feature Reduction Cell Regression Block Normal Cell

(a) Light Calibration Network (b) Normal Estimation Network
Figure 2: Our pipeline consists of two networks: (a) Light Calibration Network predicts light source directions and intensities from
images. Our search is confined to feature extraction module and classification module. (b) Normal Estimation Network outputs the surface
normal map from images and estimated light sources. Our search is confined to feature extraction module and regression module.

efficiently search for architectures at initial layers, we make nally, we train our normal estimation network from scratch
use of stem layers prior to each search space [39]. These using the searched architecture. Our normal estimation net-
layers apply fixed convolutions to enrich the input features. work uses the light directions and intensities estimated by
1. Search Space. It is a well-known fact that the kernel the light calibration network to predict normals at test time.
size has great importance in vision problems. Recent work
on photometric stereo has verified that using bigger kernel 3. Experiments and Results
size helps to explore the spatial information, but stacking
too many of them leads to over-smoothing and degrades This section first describes our procedure in preparing
the performance [73]. Therefore, we selectively use differ- the dataset for the searching, training, and testing phase.
ent kernel sizes in the candidate operations set Onormal = Later, we provide the implementation of our method, fol-
{“1 × 1 separable conv.”, “3 × 3 separable conv.”, “5 × lowed by statistical evaluations and ablation.
5 separable conv.”, “skip connection”, “zero”}. Here
3.1. Dataset Preparation
also, each convolutional layer defined in the set first applies
ReLU [71] and then convolution with given kernel size fol- We used three popular photometric stereo datasets for
lowed by batch-normalization [24]. The selection of can- our experiments, statistical analysis, and comparisons,
didate operation sets if further investigated in §3.3 of the namely, Blobby [25], Sculpture [64] , and DiLiGenT [57].
supplementary material. Search and Train Set Details. For architecture search and
2. Continuous Relaxation and Optimization. Similar to optimal architecture training, we used 10 objects from the
light calibration network, we use Eq.(3) to make the search Blobby dataset [25] and 8 from the Sculpture dataset [64].
space continuous. We then jointly search for the architec- We considered the rendered photometric stereo images of
ture encoding values and the weights using the ground-truth these datasets provided by Chen et al. [8]. It uses 64 random
surface normals and light source information during opti- lights to render the objects. In search and train phase, we
mization. The optimization is performed using the same bi- randomly choose 32 light source images. Following Chen
level optimization approximation strategy (see Eq.(4) and et al. [8], we considered 128 × 128 sized images for both
Eq.(5)). We normalize the images before feeding them to Blobby and Sculpture dataset.
the network. The normalization ensures the network is ro-
bust to different intensity levels. To search normal estima- (a) Preparation of Search Set. Searching for an optimal
tion network, we use the following cosine similarity loss: architecture using one-shot NAS [39] can be computation-
ally expensive. To address that, we use only 10% of the
m
1 X dataset such that it contains subjects from all the categories
Lnormal = (1 − ñTi ni ) (7) present in the Blobby and Sculpture dataset. Next, we re-
m i
sized all those 128 × 128 resolution images to 64 × 64. We
where, ñi is the estimated normal by our network and ni is refer this dataset as Blobby search set and Sculpture search
the ground-truth normal at pixel i. Note that ñi is a unit- set. Our search set is further divided into search train set
vector due to the fixed normalization layer. and search validation set. This train set is prepared by tak-
After the search optimization for normal estimation net- ing eight shapes from Blobby search set and six shape from
work is done, we obtain optimal discrete architecture by Sculpture search set. The search validation set is composed
(i,j)
keeping the operation o(i,j) = arg maxo∈Onormal αo on of two shapes from Blobby and Sculpture search sets, re-
each edge (i, j). Similar to [39], we only preserve the two spectively. Hence, approximately 80% of the search set is
preceding operations with highest weight for each node. Fi- used as search train set and 20% is used as search validation

7UDLQLQJ0$(
0HDQ$QJXODU(UURULQ'HJUHHV

Ground-truth BALL CAT POT1 BEAR POT2
MAElight 3.27° 8.57° 3.22° 4.74° 4.29°
Eerr 0.07 0.10 0.08 0.08 0.08
1

0
BUDDHA GOBLET READING COW HARVEST
MAElight 5.46° 8.34° 6.15° 3.74° 8.77°
(SRFKV Eerr 0.07 0.07 0.21 0.10 0.10

(a) Training Curve of Light Calibration Net (b) Light Directions and Intensities obtained using Light Calibration Network
Figure 3: (a) Training curve of the light calibration network. (b) Light calibration network results on DiLiGenT objects. We show the
light direction by projecting the vector [x, y, z] to a corresponding point [x, y]. The color of the point shows the light intensity value in
[0, 1] range. MAElight is the mean angular error in the estimation of light source direction and Eerr stands for the intensity error.

set. This is done in a way that there is no common subject We conducted all the experiments on a computer with a sin-
between train and validation sets. We used a batch size of gle NVIDIA GPU with 12GB of RAM.
four at train and validation time during search phase. The We search for two types of cells, namely normal cell and
search set is same for the light calibration and normal esti- reduction cell. We use the loss function defined in Eq.(6)
mation network’s search. and Eq.(7) during search phase to recover optimal cells for
(b) Preparation of Train Set. Once the optimal archi- each network independently. Fig.2(a) and Fig.2(b) show
tectures for light calibration and normal estimation are ob- the light calibration and the normal estimation backbone
tained, we use the train set for training these networks from and its searchable parts, respectively. For light calibration
scratch. Since, we searched architecture using 64 × 64 size network, we have two searchable blocks (i) Feature block
images, we use convolution layer with stride 2 at the train and (ii) Classification block. Here, we design our feature
time for the light calibration network’s training. Following block using three normal cells, two reduction cells, and the
Chen et al. [8], we use 99% of the Blobby and Sculpture classification block using one normal cell and one reduction
dataset for training and 1% for the validation. For light cell. Similarly, we have two searchable blocks (i) Feature
calibration we used batch size of thirty-two at train time block and (ii) Regressor block for normal estimation net-
and eight for validation. For normal estimation, instead, we work. Here, the feature block comprises three normal cells
considered batch size of four both at training and validation. and two reduction cells, while the regressor block is com-
posed of three normal cells. To construct the network design
3.1.1 Test Set Details. for searchable blocks, each normal cell is concatenated se-
quentially to the reduction cell in order. We use 3 epochs to
We tested our networks on the recently proposed DiLi- search architecture for each network.
GenT PS dataset [57]. It consists of 10 real-world objects, At train time, we regularize the normal estimation net-
with images captured by 96 LED light sources. It provides work loss function using the concept of auxiliary tower [39]
ground-truth normals and calibrated light directions making for performance gain. Consequently, we modify its loss
it an ideal dataset for evaluation. Following Chen et al. [8], function at train time as follows:
we use 96 images per object at 128 × 128 resolution to test
m m
our light calibration and normal estimation network. 1 X 1 X
Lnormal = (1 − ñTi ni ) + λaux (1 − n̂Ti ni )
m i m i
3.2. Implementation Details
(8)
The proposed method is implemented with Python 3.6, where, λaux is a regularization parameter, and n̂i is the out-
and PyTorch 1.1 [49]. For both networks, we employ the put surface normal at pixel i due to auxiliary tower. We
same optimizer, learning rate, and weight decay settings. set λaux = 0.4. We observed that the auxiliary tower im-
The architecture parameters α and the network weights ω proves the performance of the normal estimation network.
are optimized using Adam [28]. During the architecture It can be argued that a similar regularizer could be used for
search phase, the optimizer is initialized with the learning the light calibration network. However, in that case, we
rate ηalpha = 3 × 10−4 , momentum β = (0.5, 0, 999) and have to incorporate that regularizer for each image indepen-
weight decay of 1 × 10−3 . At model train time, the opti- dently, which can be computationally expensive. Fig.3(a)
mizer is initialized with the learning rate ηw = 5 × 10−4 , and Fig.4(a) show the training curve for the light calibration
momentum β = (0.5, 0, 999) and weight decay of 3×10−4 . and normal estimation network respectively. We trained the

BALL CAT POT1 BEAR POT2 BUDDHA GOBLET READING COW HARVEST
 
 7UDLQLQJ0$(
 
 0HDQ$QJXODU(UURULQ'HJUHHV
 DiLiGenT Dataset

 

  Ground-Truth Normal

 
 Estimated Normal 90°
 

 
      3.46° 8.94° 7.76° 5.48° 7.10° 10.00° 9.78° 15.02° 6.04° 17.97°
 0°

 (SRFKV Error Map

 (a) Training Curve of Normal Estimation Net (b) Surface Normals obtained using Normal Estimation Network
Figure 4: (a) Training curve of the normal estimation network.(b) Qualitative surface normal results on the DiLiGenT benchmark. The
bottom row demonstrates the angular error maps and mean angular errors of our results.

 Methods↓ | Dataset → Ball Cat Pot1 Bear Pot2 Buddha Goblet Reading Cow Harvest Average
 Alldrin et al. (2007)[2] 7.27 31.45 18.37 16.81 49.16 32.81 46.54 53.65 54.72 61.70 37.25
 Shi et al. (2010)[55] 8.90 19.84 16.68 11.98 50.68 15.54 48.79 26.93 22.73 73.86 29.59
 Wu & Tan (2013)[69] 4.39 36.55 9.39 6.42 14.52 13.19 20.57 58.96 19.75 55.51 23.93
 Lu et al. (2013)[43] 22.43 25.01 32.82 15.44 20.57 25.76 29.16 48.16 22.53 34.45 27.63
 Papadh. et al. (2014)[48] 4.77 9.54 9.51 9.07 15.90 14.92 29.93 24.18 19.53 29.21 16.66
 Lu et al. (2017) [42] 9.30 12.60 12.40 10.90 15.70 19.00 18.30 22.30 15.00 28.00 16.30
 Ours 3.46 8.94 7.76 5.48 7.10 10.00 9.78 15.02 6.04 17.97 9.15
Table 1: Quantitative comparison with the traditional uncalibrated photometric stereo methods on DiLiGenT benchmark. Our searched
architecture estimates accurate surface normals of the object with general reflectance property.

 Pn
light calibration and normal estimation networks for six and Chen et al. [7], we solve argmins i (sẽi − ei )2 using the
three epochs, respectively for inference. least squares to compute s for intensity evaluation.

3.3. Qualitative and Quantitative Evaluation
 3.3.1 Inference
Evaluation Metric. To measure the accuracy of the esti-
mated light directions and surface normals, we adopt the Once optimal architectures are obtained, we train these net-
standard mean angular error (MAE) metric as follows: works for inference. We test their performance using the
 defined metric on the Test set. For each test object, we first
 n
 180 1 X T
 feed the object images at 128 × 128 resolution to the light
 MAElight = arccos(`˜i `i ) (9) calibration network to predict the light directions and inten-
 π n i
 sities. Then, we use the images and estimated light sources
 m
 180 1 X as input to the normal estimation network to predict the sur-
 MAEnormal = arccos(ñTi ni ) (10)
 π m i face normals. Visual diagram of the optimal cell architec-
 tures is provided in the supplementary material.
where, n is the number of images, and m is the number of
object pixels. `˜i and `i denote the estimated and ground- (a) Performance of Light Calibration Network. To show
truth light directions. Similarly, ñi and ni denote the esti- the validity of our searched light calibration network, we
mated and ground-truth surface normals. As the auxiliary compared its performance on DiLiGenT ground-truth light
tower is not used at test time, we define metrics using ñi . direction and intensity. Fig.3(b) shows the quantitative and
Following previous works [6, 8], we report MAE in degrees. qualitative results obtained using our network. Concretely,
 Unlike light directions and surface normals, light inten- it provides light directions MAElight and intensity error
sity can only be estimated up to a scale factor. For this rea- (Eerr ) for all object categories. The results indicate that the
son, instead of using the exact intensity values for evalua- searched light calibration network can reliably predict light
tion, we use a scale-invariant relative error metric [6]: source direction and intensity from images of object with
 complex surface profile and different material properties.
 n
 (b) Comparison of Surface Normal Accuracy. We doc-
  
 1X |sẽi − ei |
 Eerr = (11) umented the performance comparison of our approach
 n i ei
 against the traditional uncalibrated photometric stereo
Here, ẽi and ei are the estimated and ground-truth light in- methods in Table 1. The statistics show that our method per-
tensities, respectively with s as the scale factor. Following forms significantly better than such uncalibrated approaches

BEAR Robust PS Holistic PS SDPS-Net UPS-FCN† Ours GOBLET Robust PS Holistic PS SDPS-Net UPS-FCN† Ours POT2 Robust PS Holistic PS SDPS-Net UPS-FCN† Ours

 Error Map Error Map Error Map
 0 90
 0 90 0 90

 9.07° 6.42° 6.89° 7.19° 5.48° 29.93° 20.57° 11.91° 18.07° 9.78° 15.90° 14.52° 7.50° 11.11° 7.10°

 (a) BEAR (b) GOBLET (c) POT2
Figure 5: Visual comparison against Robust PS [48], Holistic PS[69], SDPS-Net [6] and UPS-FCN [8] on (a) BEAR (b) GOBLET and
(c) POT2 object from DiLiGenT dataset. The statistics show the superiority of our searched architecture.

 Methods Params (M) Ball Cat Pot1 Bear Pot2 Buddha Goblet Reading Cow Harvest Average
 UPS-FCN† (2018)[8] 6.1 3.96 12.16 11.13 7.19 11.11 13.06 18.07 20.46 11.84 27.22 13.62
 SDPS-Net (2019) [6] 6.6 2.77 8.06 8.14 6.89 7.50 8.97 11.91 14.90 8.48 17.43 9.51
 GCNet (2020) [9] + PS-FCN [8] 6.8 2.50 7.90 7.20 5.60 7.10 8.60 9.60 14.90 7.80 16.20 8.70
 Kaya et al. (2021) [27] 8.1 3.78 7.91 8.75 5.96 10.17 13.14 11.94 18.22 10.85 25.49 11.62
 Ours (w/o auxiliary) 4.4 4.86 9.79 9.98 4.97 8.95 10.29 9.46 15.59 8.06 18.20 9.98
 Ours 4.4 3.46 8.94 7.76 5.48 7.10 10.00 9.78 15.02 6.04 17.97 9.15
Table 2: Quantitative comparison of deep uncalibrated photometric stereo methods on DiLiGenT benchmark [59]. Our searched architec-
ture on average provides results that are better compared to other deep networks not only in surface orientation accuracy (MAE) but also
in model size. The blue show the statistics where our method has the second best performance. We used deeper version of UPS-FCN [8].

for all the object categories. That is because we don’t ex- 
 
 %DOO %DOO
plicitly rely on BRDF model assumptions and the well-  %XGGKD %XGGKD

 0HDQ$QJXODU(UURULQ'HJUHHV
 
 0HDQ$QJXODU(UURULQ'HJUHHV

 &RZ &RZ
known matrix factorization approach. Instead, our work ex-  $YHUDJH  $YHUDJH
ploits the benefit of the deep neural network to handle com-  
plicated BRDF problems by learning from data. Rather than 
 

using matrix factorization, our work independently learns to 
 
estimate light from data and use it to solve surface normals. 
  

 Further, we compared our method with the state-of-the- 
       
 
       
 1XPEHURI,QSXW,PDJHV 1XPEHURI,QSXW,PDJHV
art deep uncalibrated PS methods. Table 2 shows that
our method achieves competitive results with an average (a) Light Calibration Error (b) Surface Normal Error
MAEnormal of 9.15◦ , having the second best performance Figure 6: Variation in MAE w.r.t the change in the number of
overall. The best performing method [9] uses a four-stage input images at test time. Observation with (a) light calibration
cascade structure, making it complex and deep. On the con- and (b) normal estimation network, respectively.
trary, our searched architecture is light and it can achieve
such accuracy with 2.4M fewer parameters. Fig.5 provides
additional visual comparison of our results with several 4. Conclusion
other approaches from the literature [48, 69, 6, 8]. Table
2 also shows the benefit of using an auxiliary tower at train In this paper, we demonstrated the effectiveness of applying
time (see supplementary for more details and results). differentiable NAS to deep uncalibrated PS. Though using
 the existing differentiable NAS framework directly to our
(c) Ablation Study. (i) Analysing the performance with the problem is not straightforward, we showed that we could
change in number of input images at test time. Our light successfully utilize NAS provided PS-specific constraints
calibration and normal estimation network can work with are well satisfied during the search, train, and test time.
an arbitrary number of input images at test time. In this ex- We search for an optimal light calibration network and nor-
periment, we analyse how the number of images affects the mal estimation network using the one-shot NAS method by
accuracy of the estimated lighting and surface normals. Fig. leveraging hand-crafted deep neural network design knowl-
6(a) and 6(b) show the variation in the mean angular error edge and fixing some of the layers or operations to account
with different number of images. As expected, the error de- for the PS-specific constraints. The architecture we discover
creases as we increase the number of images. Of course, is lightweight, and it provides comparable or better accu-
feeding more images allows the networks to extract more racy than the existing deep uncalibrated PS methods.
information, and therefore, the best results are obtained by
using all 96 images provided by the DiLiGenT dataset [59]. Acknowledgement. This work was funded by Focused
For more experimental results, ablations and visualizations, Research Award from Google (CVL, ETH 2019-HE-318,
refer to the supplementary material. 2019-HE-323).

References [16] Athinodoros S Georghiades. Incorporating the torrance and
sparrow model of reflectance in uncalibrated photometric
[1] Automl. https://www.automl.org/automl. Ac- stereo. In ICCV, pages 816–823. IEEE, 2003.
cessed: 02-06-2021.
[17] Dan B Goldman, Brian Curless, Aaron Hertzmann, and
[2] Neil Alldrin, Todd Zickler, and David Kriegman. Photo- Steven M Seitz. Shape and spatially-varying brdfs from pho-
metric stereo with non-parametric and spatially-varying re- tometric stereo. IEEE Transactions on Pattern Analysis and
flectance. In 2008 IEEE Conference on Computer Vision and Machine Intelligence, 32(6):1060–1071, 2009.
Pattern Recognition, pages 1–8. IEEE, 2008.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[3] Neil G Alldrin, Satya P Mallick, and David J Kriegman. Re-
Identity mappings in deep residual networks, 2016.
solving the generalized bas-relief ambiguity by entropy min-
[19] Santo Hiroaki, Michael Waechter, and Yasuyuki Matsushita.
imization. In 2007 IEEE conference on computer vision and
Deep near-light photometric stereo for spatially varying re-
pattern recognition, pages 1–7. IEEE, 2007.
flectances. In European Conference on Computer Vision,
[4] Peter N Belhumeur, David J Kriegman, and Alan L Yuille.
2020.
The bas-relief ambiguity. International journal of computer
[20] Satoshi Ikehata. Cnn-ps: Cnn-based photometric stereo for
vision, 35(1):33–44, 1999.
general non-convex surfaces. In Proceedings of the Euro-
[5] Manmohan Krishna Chandraker, Fredrik Kahl, and David J
pean Conference on Computer Vision (ECCV), pages 3–18,
Kriegman. Reflections on the generalized bas-relief ambi-
2018.
guity. In 2005 IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition (CVPR’05), volume 1, [21] Satoshi Ikehata and Kiyoharu Aizawa. Photometric stereo
pages 788–795. IEEE, 2005. using constrained bivariate regression for general isotropic
surfaces. In Proceedings of the IEEE Conference on Com-
[6] Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita,
puter Vision and Pattern Recognition, pages 2179–2186,
and Kwan-Yee K Wong. Self-calibrating deep photometric
2014.
stereo networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 8739– [22] Satoshi Ikehata and Kiyoharu Aizawa. Photometric stereo
8747, 2019. using constrained bivariate regression for general isotropic
[7] Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita, surfaces. In Proceedings of the IEEE Conference on Com-
and Kwan-Yee Kenneth Wong. Deep photometric stereo puter Vision and Pattern Recognition (CVPR), June 2014.
for non-lambertian surfaces. IEEE Transactions on Pattern [23] Satoshi Ikehata, David Wipf, Yasuyuki Matsushita, and Kiy-
Analysis and Machine Intelligence, 2020. oharu Aizawa. Photometric stereo using sparse bayesian
[8] Guanying Chen, Kai Han, and Kwan-Yee K Wong. Ps-fcn: regression for general diffuse surfaces. IEEE Transactions
A flexible learning framework for photometric stereo. In on Pattern Analysis and Machine Intelligence, 36(9):1816–
Proceedings of the European conference on computer vision 1831, 2014.
(ECCV), pages 3–18, 2018. [24] Sergey Ioffe and Christian Szegedy. Batch normalization:
[9] Guanying Chen, Michael Waechter, Boxin Shi, Kwan-Yee K Accelerating deep network training by reducing internal co-
Wong, and Yasuyuki Matsushita. What is learned in deep variate shift. arXiv preprint arXiv:1502.03167, 2015.
uncalibrated photometric stereo? In European Conference [25] M. K. Johnson and E. H. Adelson. Shape estimation in nat-
on Computer Vision, 2020. ural illumination. CVPR ’11, page 2553–2560, USA, 2011.
[10] Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. IEEE Computer Society.
Fair darts: Eliminating unfair advantages in differentiable ar- [26] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider,
chitecture search, 2020. Barnabas Poczos, and Eric Xing. Neural architecture search
[11] Hin-Shun Chung and Jiaya Jia. Efficient photometric stereo with bayesian optimisation and optimal transport, 2019.
on glossy surfaces with wide specular lobes. In 2008 IEEE [27] Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Fer-
Conference on Computer Vision and Pattern Recognition, rari, and Luc Van Gool. Uncalibrated neural inverse render-
pages 1–8. IEEE, 2008. ing for photometric stereo of general surfaces. In Proceed-
[12] Ondrej Drbohlav and M Chaniler. Can two specular pixels ings of the IEEE Conference on Computer Vision and Pattern
calibrate photometric stereo? In Tenth IEEE International Recognition (CVPR). IEEE, 2021.
Conference on Computer Vision (ICCV’05) Volume 1, vol- [28] Diederik P. Kingma and Jimmy Ba. Adam: A method for
ume 2, pages 1850–1857. IEEE, 2005. stochastic optimization, 2017.
[13] Kui Fu, Jiansheng Peng, Qiwen He, and Hanxiao Zhang. [29] Suryansh Kumar. Jumping manifolds: Geometry aware
Single image 3d object reconstruction based on deep learn- dense non-rigid structure from motion. In Proceedings of the
ing: A review. Multimedia Tools and Applications, IEEE Conference on Computer Vision and Pattern Recogni-
80(1):463–498, 2021. tion, pages 5346–5355, 2019.
[14] Y Fu, W Chen, H Wang, H Li, Y Lin, and Z Wang. Autogan- [30] Suryansh Kumar. Non-rigid structure from motion: Prior-
distiller: Searching to compress generative adversarial net- free factorization method revisited. In Proceedings of the
works. ICML, 2020. IEEE/CVF Winter Conference on Applications of Computer
[15] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and Vision, pages 51–60, 2020.
robust multiview stereopsis. IEEE transactions on pattern [31] Suryansh Kumar, Anoop Cherian, Yuchao Dai, and Hong-
analysis and machine intelligence, 32(8):1362–1376, 2009. dong Li. Scalable dense non-rigid structure-from-motion: A

grassmannian perspective. In Proceedings of the IEEE Con- [45] Shree K Nayar and Mohit Gupta. Diffuse structured light.
ference on Computer Vision and Pattern Recognition, pages In 2012 IEEE International Conference on Computational
254–263, 2018. Photography (ICCP), pages 1–11. IEEE, 2012.
[32] Suryansh Kumar, Yuchao Dai, and Hongdong Li. Monocular [46] Richard A Newcombe, Shahram Izadi, Otmar Hilliges,
dense 3d reconstruction of a complex dynamic scene from David Molyneaux, David Kim, Andrew J Davison, Pushmeet
two perspective frames. In Proceedings of the IEEE Inter- Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon.
national Conference on Computer Vision, pages 4649–4657, Kinectfusion: Real-time dense surface mapping and track-
2017. ing. In 2011 10th IEEE international symposium on mixed
[33] Suryansh Kumar, Yuchao Dai, and Hongdong Li. Superpixel and augmented reality, pages 127–136. IEEE, 2011.
soup: Monocular dense 3d reconstruction of a complex dy- [47] Tae-Hyun Oh, Hyeongwoo Kim, Yu-Wing Tai, Jean-Charles
namic scene. IEEE Transactions on Pattern Analysis and Bazin, and In So Kweon. Partial sum minimization of sin-
Machine Intelligence, 2019. gular values in rpca for low-level vision. In Proceedings of
[34] Suryansh Kumar, Ram Srivatsav Ghorakavi, Yuchao Dai, the IEEE international conference on computer vision, pages
and Hongdong Li. Dense depth estimation of a complex dy- 145–152, 2013.
namic scene without explicit 3d motion estimation. arXiv [48] Thoma Papadhimitri and Paolo Favaro. A closed-form, con-
preprint arXiv:1902.03791, 2019. sistent and robust solution to uncalibrated photometric stereo
[35] Kiriakos N Kutulakos and Eron Steger. A theory of refractive via local diffuse reflectance maxima. International journal
and specular 3d shape by light-path triangulation. Interna- of computer vision, 107(2):139–154, 2014.
tional Journal of Computer Vision, 76(1):13–29, 2008. [49] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[36] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab: ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
Hierarchical neural architecture search for semantic image differentiation in pytorch. 2017.
segmentation, 2019. [50] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and
[37] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Jeff Dean. Efficient neural architecture search via parameter
Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- sharing, 2018.
deeplab: Hierarchical neural architecture search for semantic [51] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V
image segmentation. In Proceedings of the IEEE conference Le. Regularized evolution for image classifier architecture
on computer vision and pattern recognition, pages 82–92, search, 2019.
2019. [52] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena,
[38] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Yutaka Leon Suematsu, Jie Tan, Quoc Le, and Alex Kurakin.
Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Large-scale evolution of image classifiers. arXiv preprint
Huang, and Kevin Murphy. Progressive neural architecture arXiv:1703.01041, 2017.
search, 2017. [53] Ufuk Sakarya, Uğur Murat Leloğlu, and Erol Tunalı. Three-
[39] Hanxiao Liu, Karen Simonyan, and Yiming Yang. dimensional surface reconstruction for cartridge cases using
Darts: Differentiable architecture search. arXiv preprint photometric stereo. Forensic science international, 175(2-
arXiv:1806.09055, 2018. 3):209–217, 2008.
[40] Fotios Logothetis, Ignas Budvytis, Roberto Mecca, and [54] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d:
Roberto Cipolla. A cnn based approach for the Learning 3d scene structure from a single still image. IEEE
near-field photometric stereo problem. arXiv preprint transactions on pattern analysis and machine intelligence,
arXiv:2009.05792, 2020. 31(5):824–840, 2008.
[41] Fotios Logothetis, Ignas Budvytis, Roberto Mecca, and [55] Boxin Shi, Yasuyuki Matsushita, Yichen Wei, and Chao Xu.
Roberto Cipolla. Px-net: Simple, efficient pixel-wise Self-calibrating photometric stereo. pages 1118–1125, 06
training of photometric stereo networks. arXiv preprint 2010.
arXiv:2008.04933, 2020. [56] Boxin Shi, Yasuyuki Matsushita, Yichen Wei, Chao Xu, and
[42] Feng Lu, Xiaowu Chen, Imari Sato, and Yoichi Sato. Symps: Ping Tan. Self-calibrating photometric stereo. In 2010 IEEE
Brdf symmetry guided photometric stereo for shape and light Computer Society Conference on Computer Vision and Pat-
source estimation. IEEE transactions on pattern analysis tern Recognition, pages 1118–1125. IEEE, 2010.
and machine intelligence, 40(1):221–234, 2017. [57] B. Shi, Z. Mo, Z. Wu, D. Duan, S. Yeung, and P. Tan. A
[43] Feng Lu, Imari Sato, and Yoichi Sato. Uncalibrated pho- benchmark dataset and evaluation for non-lambertian and
tometric stereo based on elevation angle recovery from brdf uncalibrated photometric stereo. IEEE Transactions on
symmetry of isotropic materials. In Proceedings of the IEEE Pattern Analysis and Machine Intelligence, 41(2):271–284,
Conference on Computer Vision and Pattern Recognition, 2019.
pages 168–176, 2015. [58] Boxin Shi, Ping Tan, Yasuyuki Matsushita, and Katsushi
[44] Davide Menini, Suryansh Kumar, Martin R Oswald, Erik Ikeuchi. Bi-polynomial modeling of low-frequency re-
Sandstrom, Cristian Sminchisescu, and Luc Van Gool. A flectances. IEEE transactions on pattern analysis and ma-
real-time online learning framework for joint 3d reconstruc- chine intelligence, 36(6):1078–1091, 2013.
tion and semantic segmentation of indoor scenes. arXiv [59] Boxin Shi, Zhe Wu, Zhipeng Mo, Dinglong Duan, Sai-Kit
preprint arXiv:2108.05246, 2021. Yeung, and Ping Tan. A benchmark dataset and evaluation

for non-lambertian and uncalibrated photometric stereo. In [74] Barret Zoph and Quoc V Le. Neural architecture search with
Proceedings of the IEEE Conference on Computer Vision reinforcement learning. arXiv preprint arXiv:1611.01578,
and Pattern Recognition, pages 3707–3716, 2016. 2016.
[60] Rhea Sanjay Sukthanker, Zhiwu Huang, Suryansh Kumar, [75] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Erik Goron Endsjo, Yan Wu, and Luc Van Gool. Neural Le. Learning transferable architectures for scalable image
architecture search of spd manifold networks, 2020. recognition. In Proceedings of the IEEE conference on com-
[61] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: puter vision and pattern recognition (CVPR), pages 8697–
Scalable and efficient object detection. In Proceedings of 8710, 2018.
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10781–10790, 2020.
[62] Ping Tan, Satya P Mallick, Long Quan, David J Kriegman,
and Todd Zickler. Isotropy, reciprocity and the generalized
bas-relief ambiguity. In 2007 IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
[63] Tatsunori Taniai and Takanori Maehara. Neural inverse ren-
dering for general reflectance photometric stereo. In Inter-
national Conference on Machine Learning (ICML), pages
4857–4866, 2018.
[64] Olivia Wiles and Andrew Zisserman. Silnet: Single-
and multi-view reconstruction by learning from silhouettes.
arXiv preprint arXiv:1711.07888, 2017.
[65] Robert J Woodham. Photometric method for determining
surface orientation from multiple images. Optical engineer-
ing, 19(1):191139, 1980.
[66] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,
Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing
Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con-
vnet design via differentiable neural architecture search. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 10734–10742, 2019.
[67] Lun Wu, Arvind Ganesh, Boxin Shi, Yasuyuki Matsushita,
Yongtian Wang, and Yi Ma. Robust photometric stereo via
low-rank matrix completion and recovery. In Asian Confer-
ence on Computer Vision, pages 703–717. Springer, 2010.
[68] Yan Wu, Zhiwu Huang, Suryansh Kumar, Rhea Sanjay Suk-
thanker, Radu Timofte, and Luc Van Gool. Trilevel neural ar-
chitecture search for efficient single image super-resolution.
arXiv preprint arXiv:2101.06658, 2021.
[69] Zhe Wu and Ping Tan. Calibrating photometric stereo by
holistic reflectance symmetry analysis. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1498–1505, 2013.
[70] Wuyuan Xie, Chengkai Dai, and Charlie CL Wang. Pho-
tometric stereo with near point lighting: A solution by
mesh deformation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 4585–
4593, 2015.
[71] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical
evaluation of rectified activations in convolutional network.
arXiv preprint arXiv:1505.00853, 2015.
[72] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhen-
guo Li. Auto-fpn: Automatic network architecture adapta-
tion for object detection beyond classification. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 6649–6658, 2019.
[73] Zhuokun Yao, Kun Li, Ying Fu, Haofeng Hu, and Boxin
Shi. Gps-net: Graph-based photometric stereo network. Ad-
vances in Neural Information Processing Systems, 33, 2020.

You can also read