Learning Accurate Dense Correspondences and When to Trust Them

Page created by Kristen Chandler

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Learning Accurate Dense Correspondences and When to Trust Them

Prune Truong Martin Danelljan Luc Van Gool Radu Timofte
Computer Vision Lab, ETH Zurich, Switzerland
{prune.truong, martin.danelljan, vangool, radu.timofte}@vision.ee.ethz.ch

Abstract
arXiv:2101.01710v1 [cs.CV] 5 Jan 2021

Establishing dense correspondences between a pair of
images is an important and general problem. However,
dense flow estimation is often inaccurate in the case of large
displacements or homogeneous regions. For most applica-
tions and down-steam tasks, such as pose estimation, image
manipulation, or 3D reconstruction, it is crucial to know (a) Query image (b) Reference image
when and where to trust the estimated correspondences.
In this work, we aim to estimate a dense flow field re-
lating two images, coupled with a robust pixel-wise con-
fidence map indicating the reliability and accuracy of the
prediction. We develop a flexible probabilistic approach
that jointly learns the flow prediction and its uncertainty.
In particular, we parametrize the predictive distribution as (c) Baseline (d) PDC-Net (Ours)
a constrained mixture model, ensuring better modelling of Figure 1. Estimating dense correspondences between the query (a)
both accurate flow predictions and outliers. Moreover, we and the reference (b) image. The query is warped according to the
develop an architecture and training strategy tailored for resulting flows (c)-(d). The baseline (c) does not estimate an un-
robust and generalizable uncertainty prediction in the con- certainty map and is therefore unable to filter the inaccurate flows
text of self-supervised training. Our approach obtains state- at e.g. occluded and homogeneous regions. In contrast, our PDC-
of-the-art results on multiple challenging geometric match- Net (d) not only estimates accurate correspondences, but also when
ing and optical flow datasets. We further validate the useful- to trust them. It predicts a robust uncertainty map that identifies ac-
ness of our probabilistic confidence estimation for the task curate matches and excludes incorrect and unmatched pixels (red).
of pose estimation. Code and models will be released at
the same scene, often captured by different cameras and at
https://github.com/PruneTruong/PDCNet.
different occasions. This leads to large displacements and
significant appearance transformations between the frames.
1. Introduction In contrast to optical flow, the more general dense corre-
spondence problem has received much less attention [39,
Finding pixel-wise correspondences between pairs of 65, 48, 54]. Dense flow estimation is prone to errors in
images is a fundamental computer vision problem with nu- the presence of large displacements, appearance changes,
merous important applications, including dense 3D recon- or homogeneous regions. It is also ill-defined in case of oc-
struction [51], video analysis [58, 43], image registration clusions or in e.g. the sky, where predictions are bound to be
[63, 56], image manipulation [14, 36], and texture or style inaccurate (Fig. 1c). For geometric matching applications,
transfer [26, 34]. Dense correspondence estimation has it is thus crucial to know when and where to trust the esti-
most commonly been addressed in the context of optical mated correspondences. For instance, pose estimation, 3D
flow [16, 2, 61, 21], where the image pairs represent con- reconstruction, and image-based localization require a set of
secutive frames in a video. While these methods excel in highly robust and accurate matches as input. The predicted
the case of small appearance changes and limited displace- dense flow field must therefore be paired with a robust con-
ments, they cannot cope with the challenges posed by the fidence estimate (Fig. 1d). Uncertainty estimation is also
more general geometric matching task. In geometric match- indispensable for safety-critical tasks, such as autonomous
ing, the images can stem from radically different views of driving and medical imaging. In this work, we set out to

expand the application domain of dense correspondence es-
timation by learning to predict reliable confidence values.
We propose the Probabilistic Dense Correspondence
Network (PDC-Net), for joint learning of dense flow and
uncertainty estimation, applicable even for extreme appear-
ance and view-point changes. Our model predicts the con-
ditional probability density of the flow, parametrized as a
constrained mixture model. However, learning reliable and
generalizable uncertainties without densely annotated real-
world training data is a highly challenging problem. Stan-
dard self-supervised techniques [65, 39, 46] do not faith-
fully model real motion patterns, appearance changes, and
Figure 2. 3D reconstruction of Aachen [49] using the dense corre-
occlusions. We tackle this challenge by introducing a care-
spondences and uncertainties predicted by PDC-Net.
fully designed architecture and improved self-supervision
to ensure robust and generalizable uncertainty predictions. instead do not make assumption on what the confidence
Contributions: Our main contributions are as follows. (i) score should be, and learn it directly from the data with
We introduce a constrained mixture model of the predictive a single unified loss. In DGC-Net, Melekhov et al. [39]
distribution, allowing the network to flexibly model both predict both dense correspondence and matchability maps
accurate predictions and outliers with large errors. (ii) We relating image pairs. However, their matchability map is
propose an architecture for predicting the parameters of our only trained to identify out-of-view pixels rather than to re-
predictive distribution, that carefully exploits the informa- flect the actual reliability of the matches. Recently, Shen et
tion encoded in the correlation volume, to achieve general- al. [54] proposed RANSAC-Flow, a two-stage image align-
izable uncertainties. (iii) We improve upon self-supervised ment method, which also outputs a matchability map. It
data generation pipelines to ensure more robust uncertainty performs coarse alignment with multiple homographies us-
estimation. (iv) We utilize our uncertainty measure to ad- ing RANSAC on off-the-shelf deep features, followed by
dress extreme view-point changes by iteratively refining the fine alignment. In contrast, we propose a unified network
prediction. (v) We perform extensive experiments on a va- that estimates probabilistic uncertainties.
riety of datasets and tasks. In particular, our approach sets a Uncertainty estimation in optical flow: While optical-
new state-of-the-art on the Megadepth geometric matching flow has been a long-standing subject of active research,
dataset [33], on the KITTI-2015 training set [13], and out- only a few methods provide uncertainty estimates. A few
performs previous dense methods for pose estimation on the approaches [3, 28, 29, 31, 1] treat the uncertainty estima-
YFCC100M dataset [62]. Moreover, without further post- tion as a post-processing step. Recently, some works pro-
processing, our confident dense matches can be directly in- pose probabilistic frameworks for joint optical flow and un-
put to 3D reconstruction pipelines, as shown in Fig. 2. certainty prediction instead. They either estimate the model
uncertainty [23, 11], termed epistemic uncertainty [25], or
2. Related work focus on the uncertainty from the observation noise, re-
ferred to as aleatoric uncertainty [25]. Following recent
Confidence estimation in geometric matching: Only works [12, 71], we focus on aleatoric uncertainty and how
very few works have explored confidence estimation in to train a generalizable uncertainty estimate in the context
the context of dense geometric or semantic matching. of self-supervised training. Wannenwetsch et al. [68] derive
Novotny et al. [41] estimate the reliability of their trained ProbFlow, a probabilistic approach applicable to energy-
descriptors by using a self-supervised probabilistic match- based optical flow algorithms [59, 3, 45]. Gast et al. [12]
ing loss for the task of semantic matching. A few ap- propose probabilistic output layers that require only min-
proaches [48, 18, 47] represent the final correspondences imal changes to existing networks. Yin et al. [71] intro-
as a 4D correspondence volume, thus inherently encoding a duce HD3 F, a method to estimate uncertainty locally at
confidence score for each tentative match. However, these multiple spatial scales and aggregate the results. While
approaches are usually restricted to low-resolution images, these approaches are carefully designed for optical flow
thus hindering accuracy. Moreover, generating one final data and restricted to small displacements, we consider the
confidence value for each match is highly non-trivial since more general setting of estimating reliable confidence val-
multiple high-scoring alternatives often co-occur. Similarly, ues for dense geometric matching, applicable to e.g. pose-
Wiles et al. [69] learn dense descriptors conditioned on an estimation and 3D reconstruction. This adds additional
image pair, along with their distinctiveness score. However, challenges, including, coping with significant appearance
the later is trained with hand-crafted heuristics, while we changes and large geometric transformations.

3. Our Approach: PDC-Net
 We introduce PDC-Net, a method for estimating the
dense flow field relating two images, coupled with a robust
pixel-wise confidence map. The later indicates the relia-
bility and accuracy of the flow prediction, which is nec-
essary for pose estimation, image manipulation, and 3D-
reconstruction tasks.

3.1. Probabilistic Flow Regression
 Figure 3. Distribution of absolute errors |b
 y −y| on MegaDepth [33]
 We formulate dense correspondence estimation with a between the flow yb estimated by GLU-Net [65] and the ground-
probabilistic model, which provides a framework for learn- truth y. Due to the logarithmic y-axis, Laplace distributions (1)
ing both the flow and its confidence in a unified formulation. correspond to linear slopes. A single Laplace cannot fit both accu-
For a given image pair X = (I q , I r ) of spatial size H × W , rate predictions (red dashed) and large errors (black dashed). Our
the aim of dense correspondence matching is to estimate a constrained mixture can model both parts as separate components.
flow field Y ∈ RH×W ×2 relating the reference image I r
 3.2. Constrained Mixture Model Prediction
to the query I q . Most learning-based methods address this
problem by training a network F with parameters θ that di- Fundamentally, the goal of probabilistic deep learning, is
rectly predicts the flow field as Y = F (X; θ). However, to achieve a predictive model p(y|X; θ) that coincides with
this does not provide any information about the confidence empirical probabilities as well as possible. We can get im-
of the prediction. portant insights into this problem by studying the empirical
 Instead of generating a single flow prediction Y , our goal error distribution of a state-of-the-art matching model, in
is to learn the conditional probability density p(Y |X; θ) of this case GLU-Net [65], as shown in Fig. 3. Current prob-
a flow Y given the input X. This is generally achieved abilistic methods [23, 43, 12] mostly rely on a Laplacian
by letting a network predict the parameters Φ(X; θ) of model (1) of p(y|X; θ). While such a model is able to fit
a family of distributions p(Y |X; θ) = p(Y |Φ(X; θ)) = the distribution of small errors up to ∼ 5 pixels, it fails to
 capture the more complex behavior of the full error distri-
Q
 ij p(yij |ϕij (X; θ)). To ensure a tractable estimation,
conditional independence of the predictions at different spa- bution. In particular, the real error distribution has a much
tial locations (i, j) is generally assumed. We use yij ∈ R2 heavier tail, consisting of large errors and outliers. For a
and ϕij ∈ Rn to denote the flow Y and predicted parame- Laplacian model, these would cause a large loss, forcing
ters Φ respectively, at location (i, j). In the following, we the model to trade-off accuracy in the small-error regime.
generally drop the sub-script ij to avoid clutter. Mixture model: To achieve a flexible model capable of fit-
 Compared to the direct approach Y = F (X; θ), the ting more complex distributions, we parametrize p(y|X; θ)
generated parameters Φ(X; θ) of the predictive distribution with a mixture model. In general, we consider a distribution
can encode richer information about the flow prediction, consisting of M components,
including its uncertainty. In probabilistic regression tech- M
niques for optical flow [23, 43, 12] and a variety of other p (y|ϕ) =
 X
 αm L (y|µ, bm ) . (2)
tasks [67, 25, 55], this is most commonly performed by m=1
predicting the variance of the estimate y. In these cases,
the predictive density p(y|ϕ) is modeled using Gaussian or While we have here chosen Laplacian components (1),
Laplace distributions. In the latter case, the density is given any simple density function can be used. The scalars
by αm ≥ 0 control the weight of each component, satisfying
 P M
 1 − |u−µ u| 1 |v−µv | m=1 αm = 1. Note that all components have the same
 L(y|µ, b) = e bu e− bv . (1) mean µ, which can thus be interpreted as the estimated flow
 2bu 2bv
 vector, but different scales bm . The employed distribution
where the components u and v of the flow vector y = (2) is therefore still unimodal, but can capture more com-
(u, v) ∈ R2 are modelled with two conditionally indepen- plex uncertainty patterns. As visualized in Fig. 3, even a
dent Laplace distributions. The mean µ = [µu , µv ]T ∈ R2 combination of two Laplacian components can represent the
and scale b = [bu , bv ]T ∈ R2 of the distribution p(y|ϕ) = sharp peak and heavy tail of the GLU-Net error distribution.
L(y|µ, b) are predicted by the network as (µ, b) = ϕ(X; θ) Mixture constraints: In general, we now consider a net-
at every spatial location. We can interpret the mean µ as the work Φ that, for each pixel location, predicts the mean flow
flow estimate and the scale b, which is proportional to the µ along with the scale bm and weight
  αm of each compo-
standard-deviation of (1), as a confidence measure. nent, as µ, (αm )M m=1 , (b )M
 m m=1 = ϕ(X; θ). However, a

 3

Correlation Uncertainty decoder
volume $
! !"# ,
Corr. Uncertainty
Ir CNN
module !
$
%& Concat ! !"#
%&.. Uncertainty

…
( Predictor

Corr. Uncertainty
module !

C
CNN

Iq
Flow decoder
)
Figure 4. Proposed architecture for flow and uncertainty estimation. The correlation uncertainty module Uθ independently processes each
2D-slice Cij·· of the correlation volume. Its output is combined with the estimated mean flow µ to predict the weight {αm }M
1 and scale
{bm }M
1 parameters of our constrained mixture model (2)-(4).

uij the parameters of a mixture
potential issue when predicting flow Y , the loss is given by
r predictor Pmodel is its permutation invariance. That is, the predicted
even if we change the order of
X
distribution (2) is unchanged − log p Y |Φ(X; θ) = − log p yij |ϕij (X; θ) . (5)
the individual components. This can cause confusion in the ij

0
learning, since the network first needs to decide what each For our approach, the last term is given by the mixture
nitialization Optimization
component should model before estimating the individual model (2). In the supplementary material, we provide ef-
weights αm and scales bm . We propose a model that breaks ficient analytic expressions of the objective (5) for our con-
the permutation invariance of the mixture (2), which sim- strained mixture model (2)-(4), that also ensure numeri-
plifies the learning and greatly improves the robustness of cal stability. As detailed in Sec. 4.1, we can train our fi-
the estimated uncertainties. In essence, each component m nal model using either a self-supervised strategy where X
is tasked with modeling a specified range of scales bm . We and Y are generated by artificial warping, using real sparse
achieve this by constraining the mixture (2) as, ground-truth Y , or a combination of both. Next, we present
the architecture of our network Φ that predicts the parame-
0 < β1− ≤ b1 ≤ β1+ ≤ β2− ≤ b2 ≤ . . . ≤ bM ≤ βM
+
. (3)
ters of our constrained mixture (2).
For simplicity, we here assume a single scale parameter bm 3.3. Uncertainty Prediction Architecture
− +
for both the u and v directions in (1). The constants βm , βm
specify the range of scales bm . Intuitively, each compo- Our aim is to predict an uncertainty value that quantifies
nent is thus responsible for a different range of uncertain- the reliability of a proposed correspondence or flow vec-
ties, roughly corresponding to different regions in the error tor. Crucially, the uncertainty prediction needs to general-
distribution in Fig. 3. In particular, component m = 1 ac- ize well to real scenarios, not seen during training. How-
counts for the most accurate predictions, while component ever, this is particularly challenging in the context of self-
m = M models the largest errors and outliers. To enforce supervised training, which relies on synthetically warped
the constraint (3), we first predict an unconstrained value images or animated data. Specifically, when trained on sim-
b̃m ∈ R, which is then mapped to the given range as, ple synthetic motion patterns, such as homography trans-
formations, the network learns to heavily rely on global
− + −
bm = β m + (βm − βm ) Sigmoid(b̃m ) , bm ∈ R. (4) smoothness assumptions, which do not generalize well to
more complex settings. It allows the network to learn to
+ −
The constraint values βm , βm can thus either be treated as confidently interpolate and extrapolate the flow field to re-
hyper-parameters or learned end-to-end alongside the other gions where no robust match can be found. Due to the sig-
network parameters θ. Lastly, we emphasize an interesting nificant distribution shift between training and test data, the
interpretation of our constrained mixture formulation (2)- network thus also infers confident, yet highly erroneous pre-
(3). Note that the predicted weights αm , in practice ob- dictions in homogeneous regions on real data. In this sec-
tained through a final SoftMax layer, represent the proba- tion, we address this problem by carefully designing an ar-
bilities of each component m. Our network therefore ef- chitecture that greatly limits the risk of the aforementioned
fectively classifies the flow prediction at each pixel into the issues. Our architecture is visualized in Figure 4.
separate uncertainty intervals (3). As detailed next, our net- Current state-of-the-art dense correspondence estimation
work learns this ability without any extra supervision. architectures rely on feature correlation layers. Features
Training objective: As customary in probabilistic regres- f are extracted at resolution h × w from a pair of input
sion [23, 12, 55, 67, 25, 66], we train our method using the images, and densely correlated either globally or within a
negative log-likelihood as the only objective. For one input local neighborhood of size d. In the later case, the out-
image pair X = (I q , I r ) and corresponding ground-truth put correlation volume is best thought of as a 4D tensor

(a) Query image (b) Reference image (c) Common decoder (d) Our decoder (e) Our decoder and data (f) RANSAC-Flow
Figure 5. Visualization of the estimated uncertainties by masking the warped query image to only show the confident flow predictions. The
standard approach (c) uses a common decoder for both flow and uncertainty estimation. It generates overly confident predictions in the sky
and grass. The uncertainty estimates are substantially improved in (d), when using the proposed architecture described in Sec. 3.3. Adding
the flow perturbations for self-supervised training (Sec. 3.4) further improves the robustness and generalization of the uncertainties (e). For
reference, we also visualize the flow and confidence mask (f) predicted by the recent state-of-the-art approach RANSAC-Flow [54].

C ∈ Rh×w×d×d . Computed as dense scalar products module Uθ , and process it with multiple convolution lay-
r T q
Cijkl = (fij ) fi+k,j+l , it encodes the deep feature sim- ers. It outputs all parameters of the mixture (2), except for
ilarity between a location (i, j) in the reference frame I r the mean flow µ (see Fig. 4). As shown in Fig. 5d, our un-
and a displaced location (i + k, j + l) in the query I q . Stan- certainty decoder, comprised of the correlation uncertainty
dard flow architectures process the correlation volume by module and the uncertainty predictor, successfully masks
first vectorizing the last two dimensions, before applying a out most of the inaccurate and unreliable matching regions.
sequence of convolutional layers over the reference coordi-
nates (i, j) in order to predict the final flow field. 3.4. Data for Self-supervised Uncertainty
Correlation uncertainty module: The straightforward While designing a suitable architecture greatly alleviates
strategy for implementing the parameter predictor Φ(X; θ) the uncertainty generalization issue, the network still tends
is to simply increase the number of output channels to in- to rely on global smoothness assumptions and interpolation,
clude all parameters of the predictive distribution. How- especially around object boundaries (see Fig. 5d). While
ever, this allows the network to rely primarily on the local this learned strategy indeed minimizes the Negative Log
neighborhood when estimating the flow and confidence at Likelihood loss (5) on self-supervised training samples, it
location (i, j), and thus to ignore the actual reliability of the does not generalize to real image pairs. In this section, we
match and appearance information at the specific location. further tackle this problem from the data perspective in the
It results in over-smoothed and overly confident predictions, context of self-supervised learning.
unable to identify ambiguous and unreliable matching re- We aim at generating less predictable synthetic motion
gions, such as the sky. This is visualized in Fig. 5c. patterns than simple homography transformations, to pre-
We instead design an architecture that assesses the uncer- vent the network from primarily relying on interpolation.
tainty at a specific location (i, j), without relying on neigh- This forces the network to focus on the appearance of the
borhood information. We note that the 2D slice Cij·· ∈ image region in order to predict its motion and uncertainty.
Rd×d encapsulates rich information about the matching Given a base flow Ỹ relating I˜r to I˜q and representing a
ability of location (i, j), in the form of a confidence map. simple transformation such as a homography as inPprior
In particular, it encodes the distinctness, uniqueness, and works [39, 65, 64, 46], we create a residual flow = i εi ,
existence of the correspondence. We therefore create a cor- by adding small local perturbations εi . The query image
relation uncertainty decoder Uθ that independently reasons I q = I˜q is left unchanged while the reference I r is gener-
about each correlation slice as Uθ (Cij·· ). In contrast to the ated by warping I˜r according to the residual flow . The
standard decoders, the convolutions are therefore applied final perturbed flow map Y between I r and I q is achieved
over the displacement dimensions (k, l). Efficient parallel by composing the base flow Ỹ with the residual flow .
implementation is ensured by moving the first two dimen- An important benefit of introducing the perturbations ε
sions of C to the batch dimension using a simple tensor re- is to teach the network to be uncertain in regions where it
shape. Our strided convolutional layers then gradually de- cannot identify them. Specifically, in homogeneous regions
crease the size d × d of the displacement dimensions (k, l) such as the sky, the perturbations do not change the appear-
until a single vector uij = Uθ (Cij·· ) ∈ Rn is achieved for ance of the reference (I r ≈ I˜r ) and are therefore unnoticed
each spatial coordinate (i, j) (see Fig. 4). by the network. However, since the perturbations break the
Uncertainty predictor: The cost volume does not capture global smoothness of the synthetic flow, the flow errors on
uncertainty arising at motion boundaries, crucial for real those pixels will be higher. In order to decrease the Neg-
data with independently moving objects. We thus addition- ative Log Likelihood loss (5), the network will thus need
ally integrate predicted flow information in the estimation to estimate a larger uncertainty for these regions. We show
of its uncertainty. In practise, we concatenate the estimated the impact of introducing the flow perturbations for self-
mean flow µ with the output of the correlation uncertainty supervised learning in Fig. 5e.

3.5. Geometric Matching Inference 4.1. Implementation Details
 In real settings with extreme view-point changes, flow We adopt the recent GLU-Net-GOCor [65, 64] as our
estimation is prone to failing. Our confidence estimate can base architecture. It consists of a four-level pyramidal net-
be used to improve the robustness of matching networks to work operating at two image resolutions and employing a
such cases. Particularly, our approach offers the opportu- VGG network [6] for feature extraction. At each level, we
nity to perform multi-stage flow estimation on challenging add our uncertainty decoder and propagate the uncertainty
image pairs, without any additional network components. prediction to the subsequent level. We model the probabil-
Confidence value: From the predictive distribution ity distribution p (y|ϕ) with a constrained mixture (Sec. 3.2)
p(y|ϕ(X; θ)), we aim at extracting a single confidence with M = 2 Laplace components, where the first is fixed to
value, encoding the reliability of the corresponding pre- b1 = β1− = β1+ = 1 to represent the very accurate predic-
dicted flow vector µ. Previous probabilistic regression tions, while the second models larger errors and outliers, as
methods mostly rely on the variance as a confidence mea- 2 = β2− ≤ b2 ≤ β2+ , where β2+ is set to the image size
sure [23, 12, 67, 25]. However, we observe that the variance used during training. We experimented with learning the
can be sensitive to outliers. Instead, we compute the prob- constraints (3), but did not see any noticeable improvement.
ability PR of the true flow being within a radius R of the Our training consists of two stages. First, we follow
estimated mean flow µ. This is expressed as, the self-supervised training procedure of [65, 64]. Random
 Z homography transformations are applied to images com-
 PR = P (|y − µ| < R) = p(y|ϕ)dy. (6) piled from different sources to ensure diversity. For bet-
 {y∈R2 :|y−µ|

MegaDepth RobotCar KITTI-2012 KITTI-2015
 PCK-1 PCK-3 PCK-5 PCK-1 PCK-3 PCK-5 AEPE ↓ F1 (%) ↓ AEPE ↓ F1 (%) ↓

 SIFT-Flow [36] 8.70 12.19 13.30 1.12 8.13 16.45 DGC-Net [39] 8.50 32.28 14.97 50.98
 NCNet [48] 1.98 14.47 32.80 0.81 7.13 16.93 GLU-Net [65] 3.14 19.76 7.49 33.83
 GLU-Net-GOCor [64] 2.68 15.43 6.68 27.57
 DGC-Net [39] 3.55 20.33 32.28 1.19 9.35 20.17
 RANSAC-Flow [54] - - 12.48 -
 GLU-Net [65] 21.58 52.18 61.78 2.30 17.15 33.87 GLU-Net-GOCor* 2.26 10.23 5.58 18.76
 GLU-Net-GOCor [64] 37.28 61.18 68.08 2.31 17.62 35.18 PDC-Net 2.21 8.45 5.66 16.72
 RANSAC-Flow (MS) [54] 53.47 83.45 86.81 2.10 16.07 31.66 PWC-Net [60] 4.14 21.38 10.35 33.7
 GLU-Net-GOCor* 57.86 78.62 82.30 2.33 17.21 33.67 LiteFlowNet [19] 4.00 - 10.39 28.5
 PDC-Net 69.78 85.24 86.87 2.47 18.52 35.58 HD3 F [71] 4.65 - 13.17 24.0
 PDC-Net (MS) 70.69 88.13 89.92 2.47 18.43 35.37 LiteFlowNet2 [20] 3.42 - 8.97 25.9
 VCN [70] - - 8.36 25.1
Table 1. PCK (%) results on sparse correspondences of the RAFT [61] - - 5.54 19.8
MegaDepth [33] and RobotCar [38, 32] dataset. Table 2. Optical flow results on the training splits of KITTI [13].
 The upper part contains generic dense matching networks, while
ing ground-truth dense matches in real scenes, we evalu- the bottom part lists specialized optical flow methods.
ate on standard datasets with sparse ground-truth, namely
the RobotCar [38, 32], MegaDepth [33] and ETH3D [53] ods (upper part) by a large margin in terms of both F1 and
datasets. Images in RobotCar depict outdoor road scenes, AEPE. Surprisingly, PDC-Net also obtains better results
taken under different weather and lighting conditions. than most optical flow methods (bottom part), even outper-
While the image pairs show similar view-points, they are forming the recent RAFT [61] in F1 metric on KITTI-2015.
particularly challenging due to their numerous textureless
regions. MegaDepth images show extreme view-point and 4.3. Uncertainty Estimation
appearance variations. Finally, ETH3D depicts indoor and Next, we evaluate our uncertainty estimation. To assess
outdoor scenes captured from a moving hand-held cam- the quality of the uncertainty estimates, we rely on Spar-
era. For RobotCar and MegaDepth, we evaluate on the sification Error plots, in line with [24, 68, 1]. The pixels
correspondences provided by [54], which includes approxi- having the highest uncertainty are progressively removed
mately 340M and 367K ground-truth matches respectively. and the AEPE or PCK of the remaining pixels is calculated,
For ETH3D, we follow the protocol of [65], sampling image which results in the Sparsification curve. The Error curve
pairs at different intervals to analyze varying magnitude of is constructed by subtracting the Sparsification to the Ora-
geometric transformations, which results in 600K to 1100K cle, for which the AEPE and PCK are calculated when the
ground-truth matches per interval. In line with [54], we em- pixels are ranked according to the ground-truth error. As
ploy the Percentage of Correct Keypoints at a given pixel evaluation metric, we use the Area Under the Sparsification
threshold T (PCK-T ) as the evaluation metric. Error curve (AUSE). In Fig. 7, we compare the Sparsifica-
Results: In Tab. 1 we report results on MegaDepth and tion Error plots on MegaDepth, of our PDC-Net with other
RobotCar. Our method PDC-Net outperforms all previous dense methods providing a confidence estimation, namely
works by a large margin at all PCK thresholds. In partic- DGC-Net [39] and RANSAC-Flow [54]. Our probabilis-
ular, our approach is significantly more accurate and ro- tic method PDC-Net produces uncertainty maps that much
bust than the very recent RANSAC-Flow, which utilizes better correspond to the true errors.
an extensive multi-scale (MS) search. Interestingly, our
uncertainty-aware probabilistic approach also outperforms 4.4. Pose and 3D Estimation
the baseline GLU-Net-GOCor* in pure flow accuracy. This Finally, to show the joint performance of our flow and
clearly demonstrates the advantages of casting the flow es- uncertainty prediction, we evaluate our approach for pose
timation as a probabilistic regression problem, advantages estimation. This application has traditionally been domi-
which are not limited to uncertainty estimation. It also sub- nated by sparse matching methods.
stantially benefits the accuracy of the flow itself through
 Pose estimation: Given a pair of images showing dif-
a more flexible loss formulation. In Fig. 6, we plot the
PCKs on ETH3D. Our approach is consistently better than 0.6
 0.30
 RANSAC-Flow (0.0583)
RANSAC-Flow and GLU-Net-GOCor* for all intervals.
 Sparsification Error for AEPE

 DGC-Net (0.0961)
 Sparsification Error for PCK-5

 0.5
 0.25 PDC-Net (0.0153)
Generalization to optical flow: We additionally show that 0.4
 0.20

our approach generalizes well to accurate estimation of op- 0.3
 0.15

tical flow, even though it is trained for the very different 0.2 0.10
 RANSAC-Flow (0.3839)
task of geometric matching. We use the established KITTI 0.1
 DGC-Net (0.3205) 0.05

dataset [13], and evaluate according to the standard met- 0.0 PDC-Net (0.2544) 0.00
 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
rics, namely AEPE and F1. Since we do not fine-tune on Removing x fraction of pixels Removing x fraction of pixels

KITTI, we show results on the training splits in Tab. 2. Our Figure 7. Sparsification Error plots for AEPE (left) and PCK-5
approach outperforms all previous generic matching meth- (right) on MegaDepth. Smaller AUSE (in parenthesis) is better.

 7

KITTI-2015 MegaDepth YFCC100M
ferent view-points of the same scene, two-view geometry AEPE F1 (%) AUSE PCK-1 (%) PCK-5 (%) AUSE mAP @5° mAP @10°

estimation aims at recovering their relative pose. We fol- BaseNet (L1-loss)
Single Laplace
7.51
6.86
37.19
34.27
-
0.220
20.00
27.45
60.00
62.24
-
0.210
15.58
26.95
24.00
37.10
Unconstrained Mixture 6.60 32.54 0.670 30.18 66.24 0.433 31.18 42.55
low the standard set-up of [72] and evaluate on 4 scenes Constrained Mixture (PDC-Net-s) 6.66 32.32 0.205 32.51 66.50 0.210 33.77 45.17

of the YFCC100M dataset [62], each comprising 1000 im- Commun Dec.
Corr unc. module
6.41
6.32
32.03
31.12
0.171
0.418
31.93
31.97
67.34
66.80
0.213
0.278
31.13
33.95
42.21
45.44
Unc. Dec. (Fig 4) (PDC-Net-s) 6.66 32.32 0.205 32.51 66.50 0.210 33.77 45.17
age pairs. As evaluation metrics, we use mAP for dif- BaseNet w/o Perturbations 7.21 37.35 - 20.74 59.35 - 15.15 23.88
ferent thresholds on the angular deviation between ground BaseNet w Perturbations
PDC-Net-s w/o Perturbations
7.51
7.15
37.19
35.28
-
0.256
20.00
31.53
60.00
65.03
-
0.219
15.58
32.50
24.00
43.17
PDC-Net-s w Perturbations 6.66 32.32 0.205 32.51 66.50 0.210 33.77 45.17
truth and predicted vectors for both rotation and transla-
tion. Results are presented in Tab. 3. Our approach PDC- Table 4. Ablation study. In the top part, different probabilistic
models are compared (Sec. 3.1-3.2). In the middle part, a con-
Net outperforms the recent D2D [69] and obtains very sim-
strained Mixture is used, and different architectures for uncertainty
ilar results than RANSAC-Flow, while being 12.2 times
estimation are compared (Sec. 3.3). In the bottom part, we analyze
faster. With our multi-scale (MS) strategy, PDC-Net outper- the impact of our training data with perturbations (Sec. 3.4).
forms RANSAC-Flow while being 3.6 times faster. Note
that RANSAC-Flow employs its own MS strategy, using tion. While an unconstrained mixture of Laplace already
additional off-the-shelf features, which are exhaustively drastically improves upon the single Laplace component,
matched with nearest neighbor criteria [37]. In compari- the permutation invariance of the unconstrained mixture
son, our proposed MS is a simpler, faster and more uni- confuses the network, which results in poor uncertainty es-
fied approach. We also note that RANSAC-Flow relies on timates (high AUSE). Constraining the mixture instead re-
a semantic segmentation network to better filter unreliable sults in better metrics for both the flow and the uncertainty.
correspondences, in e.g. sky. Without this segmentation, Uncertainty decoder architecture (Tab. 4, mid-
the performance is drastically reduced. In contrast, our ap- dle): While the compared uncertainty decoder archi-
proach can directly estimate highly robust and generalizable tectures achieve similar quality in flow prediction, they
confidence maps, without the need for additional network provide notable differences in uncertainty estimation.
components. The confidence masks of RANSAC-Flow and Only using the correlation uncertainty module leads to the
our approach are visually compared in Fig. 5e-5f. best results on YFCC100M, since the module enables to
Extension to 3D reconstruction: We also qualitatively efficiently discard unreliable matching regions, in particular
show the usability of our approach for dense 3D reconstruc- compared to the common decoder approach. However,
tion. We compute dense correspondences between day-time this module alone does not take into account motion
images of the Aachen city from the Visual Localization boundaries. This leads to poor AUSE on KITTI-2015,
benchmark [49, 50]. Accurate matches are then selected by which contains independently moving objects. Our final
thresholding our confidence map, and fed to COLMAP [51] architecture (Fig 4), additionally integrating the mean flow
to build a 3D point-cloud. It is visualized in Fig. 2. into the uncertainty estimation, offers the best compromise.

4.5. Ablation study Perturbation data (Tab. 4, bottom): While introduc-
ing the perturbations does not help the flow prediction for
Here, we perform a detailed analysis of our approach in BaseNet, it provides significant improvements in uncer-
Tab. 4. As baseline, we use a simplified three-level pyrami- tainty and in flow performance for our PDC-Net-s. This
dal network, called BaseNet [65]. All methods are trained emphasizes that the improvement of the uncertainty esti-
using only the first-stage training, described in Sec. 4.1. mates originating from introducing the perturbations, also
Probabilistic model (Tab. 4, top): We first com- leads to improved and more generalizable flow predictions.
pare BaseNet to our approach PDC-Net-s, modelling the
flow distribution with a constrained mixture of Laplace 5. Conclusion
(Sec. 3.2). On both KITTI-2015 and MegaDepth, our ap-
proach brings a significant improvement in terms of flow We propose a probabilistic deep network for estimating
metrics. Note also that performing pose estimation by tak- the dense image-to-image correspondences and associated
ing all correspondences (BaseNet) performs very poorly, confidence estimate between the two images. Specifically,
which demonstrates the need for robust uncertainty estima- we train the network to predict the parameters of the condi-
tional probability density of the flow, which we model with
mAP @5° mAP @10° mAP @20° Run-time (s) a constrained mixture of Laplace distributions. Moreover,
Superpoint [8] 30.50 50.83 67.85 - we introduce an architecture and improved self-supervised
SIFT [37] 46.83 68.03 80.58 -
D2D [69] 55.58 66.79 - - training strategy, specifically designed for robust and gen-
RANSAC-Flow (MS+SegNet) [54] 64.88 73.31 81.56 9.06 eralizable uncertainty prediction. Our approach PDC-Net
RANSAC-Flow (MS) [54] 31.25 38.76 47.36 8.99
PDC-Net 63.90 73.00 81.22 0.74 sets a new state-of-the-art on multiple challenging geomet-
PDC-Net (MS) 65.18 74.21 82.42 2.55
ric matching and optical flow datasets. It also outperforms
Table 3. Two-view geometry estimation on YFCC100M [62]. previous dense matching methods on pose estimation.

Acknowledgements: This work was partly supported by [12] Jochen Gast and Stefan Roth. Lightweight probabilistic deep
the ETH Zürich Fund (OK), a Huawei Gift for research, networks. In 2018 IEEE Conference on Computer Vision and
a Huawei Technologies Oy (Finland) project, an Amazon Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,
AWS grant, and an Nvidia hardware grant. June 18-22, 2018, pages 3369–3378, 2018. 2, 3, 4, 6
 [13] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
 Urtasun. Vision meets robotics: The kitti dataset. I. J.
References Robotic Res., 32(11):1231–1237, 2013. 2, 7, 18, 19
 [1] Oisin Mac Aodha, Ahmad Humayun, M. Pollefeys, and G. [14] Yoav HaCohen, Eli Shechtman, Dan B. Goldman, and Dani
 Brostow. Learning a confidence measure for optical flow. Lischinski. Non-rigid dense correspondence with applica-
 IEEE Transactions on Pattern Analysis and Machine Intelli- tions for image enhancement. ACM Trans. Graph., 30(4):70,
 gence, 35:1107–1120, 2013. 2, 7 2011. 1
 [2] Simon Baker, Daniel Scharstein, J. P. Lewis, Stefan Roth, [15] Jared Heinly, Johannes Lutz Schönberger, Enrique Dunn,
 Michael J. Black, and Richard Szeliski. A database and eval- and Jan-Michael Frahm. Reconstructing the World* in
 uation methodology for optical flow. International Journal Six Days *(As Captured by the Yahoo 100 Million Im-
 of Computer Vision, 92(1):1–31, 2011. 1 age Dataset). In Computer Vision and Pattern Recognition
 [3] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance (CVPR), 2015. 19
 of optical flow techniques. INTERNATIONAL JOURNAL OF [16] Berthold K. P. Horn and Brian G. Schunck. ”determining
 COMPUTER VISION, 12:43–77, 1994. 2 optical flow”: A retrospective. Artif. Intell., 59(1-2):81–87,
 1993. 1
 [4] Luca Bertinetto, João F. Henriques, Philip H. S. Torr, and
 Andrea Vedaldi. Meta-learning with differentiable closed- [17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-
 form solvers. In 7th International Conference on Learning ian Q. Weinberger. Densely connected convolutional net-
 Representations, ICLR 2019, New Orleans, LA, USA, May works. In 2017 IEEE Conference on Computer Vision and
 6-9, 2019, 2019. 21 Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July
 21-26, 2017, pages 2261–2269. IEEE Computer Society,
 [5] Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khved-
 2017. 16, 17
 chenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A.
 [18] Shuaiyi Huang, Qiuyue Wang, Songyang Zhang, Shipeng
 Kalinin. Albumentations: Fast and flexible image augmen-
 Yan, and Xuming He. Dynamic context correspondence net-
 tations. Information, 11(2), 2020. 15
 work for semantic alignment. In 2019 IEEE/CVF Interna-
 [6] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. tional Conference on Computer Vision, ICCV 2019, Seoul,
 Return of the devil in the details: Delving deep into convo- Korea (South), October 27 - November 2, 2019, pages 2010–
 lutional nets. In BMVC, 2014. 6, 17 2019. IEEE, 2019. 2
 [7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [19] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Lite-
 Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe flownet: A lightweight convolutional neural network for op-
 Franke, Stefan Roth, and Bernt Schiele. The cityscapes tical flow estimation. In 2018 IEEE Conference on Computer
 dataset for semantic urban scene understanding. In Proc. Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
 of the IEEE Conference on Computer Vision and Pattern UT, USA, June 18-22, 2018, pages 8981–8989, 2018. 7
 Recognition (CVPR), 2016. 15 [20] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. A
 [8] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- Lightweight Optical Flow CNN - Revisiting Data Fidelity
 novich. Superpoint: Self-supervised interest point detection and Regularization. 2020. 7
 and description. In 2018 IEEE Conference on Computer Vi- [21] Junhwa Hur and S. Roth. Optical flow estimation in the deep
 sion and Pattern Recognition Workshops, CVPR Workshops learning age. ArXiv, abs/2004.02853, 2020. 1
 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 224– [22] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth
 236, 2018. 8, 19, 20, 21 Vanhoey, and Luc Van Gool. Dslr-quality photos on mobile
 [9] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- devices with deep convolutional networks. In IEEE Interna-
 feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net: tional Conference on Computer Vision, ICCV 2017, Venice,
 A Trainable CNN for Joint Detection and Description of Lo- Italy, October 22-29, 2017, pages 3297–3305, 2017. 15
 cal Features. In Proceedings of the 2019 IEEE/CVF Confer- [23] Eddy Ilg, Özgün Çiçek, Silvio Galesso, Aaron Klein, Osama
 ence on Computer Vision and Pattern Recognition, 2019. 15, Makansi, Frank Hutter, and Thomas Brox. Uncertainty es-
 19, 21 timates and multi-hypotheses networks for optical flow. In
[10] Martin A. Fischler and Robert C. Bolles. Random sample Computer Vision - ECCV 2018 - 15th European Conference,
 consensus: A paradigm for model fitting with applications to Munich, Germany, September 8-14, 2018, Proceedings, Part
 image analysis and automated cartography. Commun. ACM, VII, pages 677–693, 2018. 2, 3, 4, 6
 24(6):381–395, June 1981. 19 [24] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,
[11] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu-
 approximation: Representing model uncertainty in deep tion of optical flow estimation with deep networks. In 2017
 learning. In Proceedings of the 33rd International Confer- IEEE Conference on Computer Vision and Pattern Recog-
 ence on International Conference on Machine Learning - nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017,
 Volume 48, ICML’16, page 1050–1059. JMLR.org, 2016. 2 pages 1647–1655. IEEE Computer Society, 2017. 7

 9

[25] Alex Kendall and Yarin Gal. What uncertainties do we need IEEE Trans. Pattern Anal. Mach. Intell., 33(5):978–994,
 in bayesian deep learning for computer vision? In I. Guyon, 2011. 1, 7
 U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- [37] David G. Lowe. Distinctive image features from scale-
 wanathan, and R. Garnett, editors, Advances in Neural Infor- invariant keypoints. Int. J. Comput. Vision, 60(2):91–110,
 mation Processing Systems, volume 30, pages 5574–5584. Nov. 2004. 8
 Curran Associates, Inc., 2017. 2, 3, 4, 6 [38] Will Maddern, Geoff Pascoe, Chris Linegar, and Paul New-
[26] Seungryong Kim, Dongbo Min, Somi Jeong, Sunok Kim, man. 1 Year, 1000km: The Oxford RobotCar Dataset. The
 Sangryul Jeon, and Kwanghoon Sohn. Semantic attribute International Journal of Robotics Research (IJRR), 36(1):3–
 matching networks. In IEEE Conference on Computer Vi- 15, 2017. 7, 18
 sion and Pattern Recognition, CVPR 2019, Long Beach, CA, [39] Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc
 USA, June 16-20, 2019, pages 12339–12348, 2019. 1 Pollefeys, Esa Rahtu, and Juho Kannala. DGC-Net: Dense
[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for geometric correspondence network. In Proceedings of the
 stochastic optimization. In 3rd International Conference on IEEE Winter Conference on Applications of Computer Vision
 Learning Representations, ICLR 2015, San Diego, CA, USA, (WACV), 2019. 1, 2, 5, 7, 15, 21
 May 7-9, 2015, Conference Track Proceedings, 2015. 18 [40] David Nistér. An efficient solution to the five-point rela-
[28] Claudia Kondermann, Daniel Kondermann, Bernd Jähne, tive pose problem. IEEE Trans. Pattern Anal. Mach. Intell.,
 and Christoph S. Garbe. An adaptive confidence measure for 26(6):756–777, June 2004. 19
 optical flows based on linear subspace projections. In Pattern [41] David Novotny, Samuel Albanie, Diane Larlus, and Andrea
 Recognition, 29th DAGM Symposium, Heidelberg, Germany, Vedaldi. Self-supervised learning of geometrically stable
 September 12-14, 2007, Proceedings, pages 132–141, 2007. features through probabilistic introspection. In 2018 IEEE
 2 Conference on Computer Vision and Pattern Recognition,
[29] Claudia Kondermann, Rudolf Mester, and Christoph S. CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018,
 Garbe. A statistical confidence measure for optical flows. 2018. 2
 In Computer Vision - ECCV 2008, 10th European Confer- [42] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
 ence on Computer Vision, Marseille, France, October 12-18, Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban
 2008, Proceedings, Part III, pages 290–301, 2008. 2 Desmaison, Luca Antiga, and Adam Lerer. Automatic dif-
[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ferentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
 Imagenet classification with deep convolutional neural net- 18
 works. In Advances in Neural Information Processing Sys- [43] Jianing Qian, Junyu Nan, Siddharth Ancha, Brian Okorn,
 tems 25: 26th Annual Conference on Neural Information and David Held. Robust instance tracking via uncertainty
 Processing Systems 2012. Proceedings of a meeting held De- flow. CoRR, abs/2010.04367, 2020. 1, 3
 cember 3-6, 2012, Lake Tahoe, Nevada, United States., pages [44] Jerome Revaud, Philippe Weinzaepfel, César Roberto de
 1106–1114, 2012. 17 Souza, and Martin Humenberger. R2D2: repeatable and re-
[31] Jan Kybic and Claudia Nieuwenhuis. Bootstrap optical flow liable detector and descriptor. In NeurIPS, 2019. 19
 confidence and uncertainty measure. Comput. Vis. Image [45] Jérôme Revaud, Philippe Weinzaepfel, Zaı̈d Harchaoui, and
 Underst., 115(10):1449–1462, 2011. 2 Cordelia Schmid. Epicflow: Edge-preserving interpolation
[32] Måns Larsson, Erik Stenborg, Lars Hammarstrand, Marc of correspondences for optical flow. In CVPR, pages 1164–
 Pollefeys, Torsten Sattler, and Fredrik Kahl. A cross-season 1172. IEEE Computer Society, 2015. 2
 correspondence dataset for robust semantic segmentation. In [46] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convo-
 IEEE Conference on Computer Vision and Pattern Recogni- lutional neural network architecture for geometric matching.
 tion, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, In 2017 IEEE Conference on Computer Vision and Pattern
 pages 9532–9542, 2019. 7, 27 Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26,
[33] Zhengqi Li and Noah Snavely. Megadepth: Learning single- 2017, pages 39–48, 2017. 2, 5, 6
 view depth prediction from internet photos. In 2018 IEEE [47] I. Rocco, R. Arandjelović, and J. Sivic. Efficient neighbour-
 Conference on Computer Vision and Pattern Recognition, hood consensus networks via submanifold sparse convolu-
 CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, tions. In European Conference on Computer Vision, 2020.
 pages 2041–2050, 2018. 2, 3, 6, 7, 14, 28, 29, 30 2
[34] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing [48] Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovic, Akihiko
 Kang. Visual attribute transfer through deep image analogy. Torii, Tomás Pajdla, and Josef Sivic. Neighbourhood consen-
 ACM Trans. Graph., 36(4), July 2017. 1 sus networks. In Advances in Neural Information Processing
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Systems 31: Annual Conference on Neural Information Pro-
 Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence cessing Systems 2018, NeurIPS 2018, 3-8 December 2018,
 Zitnick. Microsoft coco: Common objects in context. In Montréal, Canada., pages 1658–1669, 2018. 1, 2, 7
 David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuyte- [49] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii,
 laars, editors, Computer Vision – ECCV 2014, pages 740– Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi
 755, Cham, 2014. Springer International Publishing. 6, 15 Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and
[36] Ce Liu, Jenny Yuen, and Antonio Torralba. SIFT flow: Tomás Pajdla. Benchmarking 6dof outdoor visual localiza-
 Dense correspondence across scenes and its applications. tion in changing conditions. In 2018 IEEE Conference on

 10

Computer Vision and Pattern Recognition, CVPR 2018, Salt [61] Zachary Teed and Jia Deng. RAFT: recurrent all-pairs field
 Lake City, UT, USA, June 18-22, 2018, pages 8601–8610, transforms for optical flow. In Computer Vision - ECCV 2020
 2018. 2, 8, 19 - 16th European Conference, Glasgow, UK, August 23-28,
[50] Torsten Sattler, Tobias Weyand, Bastian Leibe, and Leif 2020, Proceedings, Part II, pages 402–419, 2020. 1, 7
 Kobbelt. Image retrieval for image-based localization revis- [62] Bart Thomee, David A. Shamma, Gerald Friedland, Ben-
 ited. In British Machine Vision Conference, BMVC 2012, jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
 Surrey, UK, September 3-7, 2012, pages 1–12, 2012. 8, 19 Li-Jia Li. YFCC100M: the new data in multimedia research.
[51] Johannes L. Schönberger and Jan-Michael Frahm. Structure- Commun. ACM, 59(2):64–73, 2016. 2, 8, 26
 from-motion revisited. In 2016 IEEE Conference on Com- [63] Prune Truong, Stefanos Apostolopoulos, Agata Mosinska,
 puter Vision and Pattern Recognition, CVPR 2016, Las Ve- Samuel Stucky, Carlos Ciller, and Sandro De Zanet. Glam-
 gas, NV, USA, June 27-30, 2016, pages 4104–4113, 2016. 1, points: Greedily learned accurate match points. 2019
 8, 18 IEEE/CVF International Conference on Computer Vision
[52] Johannes L. Schönberger and Jan-Michael Frahm. Structure- (ICCV), pages 10731–10740, 2019. 1
 from-motion revisited. In 2016 IEEE Conference on Com- [64] Prune Truong, Martin Danelljan, Luc Van Gool, and Radu
 puter Vision and Pattern Recognition, CVPR 2016, Las Ve- Timofte. GOCor: Bringing globally optimized correspon-
 gas, NV, USA, June 27-30, 2016, pages 4104–4113, 2016. dence volumes into your neural network. In Annual Confer-
 15, 19 ence on Neural Information Processing Systems, NeurIPS,
[53] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, 2020. 5, 6, 7, 14, 16, 17, 18, 19, 21
 Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- [65] Prune Truong, Martin Danelljan, and Radu Timofte. GLU-
 dreas Geiger. A multi-view stereo benchmark with high- Net: Global-local universal network for dense flow and cor-
 resolution images and multi-camera videos. In 2017 IEEE respondences. In 2020 IEEE Conference on Computer Vision
 Conference on Computer Vision and Pattern Recognition, and Pattern Recognition, CVPR 2020, 2020. 1, 2, 3, 5, 6, 7,
 CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 8, 16, 17, 18, 21
 2538–2547, 2017. 6, 7, 18 [66] Ali Varamesh and Tinne Tuytelaars. Mixture dense regres-
[54] Xi Shen, François Darmon, Alexei A Efros, and Mathieu sion for object detection and human pose estimation. In
 Aubry. Ransac-flow: generic two-stage image alignment. In 2020 IEEE/CVF Conference on Computer Vision and Pat-
 16th European Conference on Computer Vision, 2020. 1, 2, tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-
 5, 6, 7, 8, 18, 19, 20, 21 19, 2020, pages 13083–13092, 2020. 4
[55] Yichen Shen, Zhilu Zhang, Mert R. Sabuncu, and Lin Sun. [67] Stefanie Walz, Tobias Gruber, Werner Ritter, and Klaus Di-
 Learning the distribution: A unified distillation paradigm etmayer. Uncertainty depth estimation with gated images for
 for fast uncertainty estimation in computer vision. CoRR, 3d reconstruction. CoRR, abs/2003.05122, 2020. 3, 4, 6
 abs/2007.15857, 2020. 3, 4 [68] Anne S. Wannenwetsch, Margret Keuper, and Stefan Roth.
[56] Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, Probflow: Joint optical flow and uncertainty estimation. In
 and Alexei A. Efros. Data-driven visual similarity for IEEE International Conference on Computer Vision, ICCV
 cross-domain image matching. ACM Transaction of Graph- 2017, Venice, Italy, October 22-29, 2017, pages 1182–1191,
 ics (TOG) (Proceedings of ACM SIGGRAPH ASIA), 30(6), 2017. 2, 7
 2011. 1 [69] Olivia Wiles, Sébastien Ehrhardt, and Andrew Zisserman.
[57] Patrice Y. Simard, Dave Steinkraus, and John C. Platt. Best D2D: learning to find good correspondences for image
 practices for convolutional neural networks applied to visual matching and manipulation. CoRR, abs/2007.08480, 2020.
 document analysis. In Proceedings of the Seventh Interna- 2, 8
 tional Conference on Document Analysis and Recognition - [70] Gengshan Yang and Deva Ramanan. Volumetric corre-
 Volume 2, ICDAR ’03, page 958, USA, 2003. IEEE Com- spondence networks for optical flow. In H. Wallach, H.
 puter Society. 15 Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R.
[58] Karen Simonyan and Andrew Zisserman. Two-stream con- Garnett, editors, Advances in Neural Information Process-
 volutional networks for action recognition in videos. In Z. ing Systems, volume 32, pages 794–805. Curran Associates,
 Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Inc., 2019. 7
 Weinberger, editors, Advances in Neural Information Pro- [71] Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical
 cessing Systems, volume 27, pages 568–576. Curran Asso- discrete distribution decomposition for match density esti-
 ciates, Inc., 2014. 1 mation. In IEEE Conference on Computer Vision and Pat-
[59] Deqing Sun, Stefan Roth, and Michael J. Black. A quan- tern Recognition, CVPR 2019, Long Beach, CA, USA, June
 titative analysis of current practices in optical flow estima- 16-20, 2019, pages 6044–6053, 2019. 2, 7
 tion and the principles behind them. Int. J. Comput. Vision, [72] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei
 106(2):115–137, Jan. 2014. 2 Zhou, Tianwei Shen, Yurong Chen, Hongen Liao, and Long
[60] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Quan. Learning two-view correspondences and geometry us-
 Pwc-net: Cnns for optical flow using pyramid, warping, and ing order-aware network. In 2019 IEEE/CVF International
 cost volume. In 2018 IEEE Conference on Computer Vision Conference on Computer Vision, ICCV 2019, Seoul, Korea
 and Pattern Recognition, CVPR 2018, Salt Lake City, UT, (South), October 27 - November 2, 2019, pages 5844–5853,
 USA, June 18-22, 2018, pages 8934–8943, 2018. 7 2019. 8, 18

 11

[73] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
 dler, Adela Barriuso, and Antonio Torralba. Semantic un-
 derstanding of scenes through the ADE20K dataset. Int. J.
 Comput. Vis., 127(3):302–321, 2019. 15

 12

Appendix (A) MegaDepth

 In this supplementary material, we first give a detailed
derivation of our probabilistic model as a constrained mix-
ture of Laplace distributions in Sec. A. In Sec. B, we then
derive our probabilistic training loss and explain our train-
ing procedure in more depth. We subsequently follow by
providing additional information about the architecture of
our proposed networks as well as implementation details
in Sec. C. In Sec. D, we extensively explain the evaluation
datasets and set-up. Then, we present more detailed quan-
titative and qualitative results in Sec. E. In particular, we
show quantitative results on image-based localization on the
Aachen dataset, corresponding to the visual 3D reconstruc-
tion results shown in the main paper. Finally, we perform
detailed ablative experiments in Sec. F.

A. Detailed derivation of probabilistic model
 Here we provide the details of the derivation of our un- (B) KITTI-2015
certainty estimate.
Probabilistic formulation: We model the flow estimation
as a probabilistic regression with a constrained mixture den-
sity of Laplacians model (Sec. 3.2 of the main paper). Our
mixture density distribution, corresponding to equation (2)
of the main paper, is expressed as,
 M
 X
 p (y|ϕ) = αm L(y|µ, bm ) (7)
 m=1

where, for each component m, the bi-variate Laplace distri- Figure 8. Visualization of the mixture parameters (αm )M m=1 and b2
bution L(y|µ, bm ) is computed as the product of two inde- predicted by our final network PDC-Net, on multiple image pairs.
pendent uni-variate Laplace distributions, such as, PDC-Net has M = 2 Laplace components and here, we do not
 represent the scale parameter b1 , since it is fixed as b1 = 1.0. We
 L(y|µ, bm ) = L(u, v|µu , µv , bu , bv ) (8a) also show the resulting confidence maps PR for multiple R.
 = L(u|µu , bu ).L(v|µv , bv ) (8b)
 1 − |u−µ u| 1 |v−µv |
 µ, (αm )M M
 
 = e bu e− bv . (8c) m=1 , (bm )m=1 = ϕ(X; θ). However, we aim at
 2bu 2bv obtaining a single confidence value to represent the relibia-
 bility of the estimated flow vector µ. As a final confidence
where µ = [µu , µv ]T ∈ R2 and bm = [bu , bv ]T ∈ R2 are measure, we thus compute the probability PR of the true
respectively the mean and the scale parameters of the dis- flow being within a radius R of the estimated mean flow
tribution L(y|µ, bm ). In this work, we additionally define vector µ. This is expressed as,
equal scales in both flow directions, such that bm = bu =
bv ∈ R. As a result, equation (8) simplifies, and when in-
 PR = P (ky − µk∞ < R) (10a)
serting into (7), we obtain, Z
 M = p(y|ϕ)dy (10b)
 X 1 |y−µ|1 {y∈R2 :ky−µk∞

You can also read