Autotuning Hamiltonian Monte Carlo for efficient generalized nullspace exploration

Page created by Anne Warner

Sports

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Autotuning Hamiltonian Monte Carlo for efficient generalized nullspace exploration

Geophys. J. Int. (2021) 227, 941–968                                                                                  https://doi.org/10.1093/gji/ggab270
Advance Access publication 2021 July 15
GJI Seismology

Autotuning Hamiltonian Monte Carlo for efficient generalized
nullspace exploration

Andreas Fichtner , Andrea Zunino, Lars Gebraad and Christian Boehm
Department of Earth Sciences, ETH Zurich, Sonneggstrasse 5, 8092 Zurich, Switzerland. E-mail: andreas.fichtner@erdw.ethz.ch

Accepted 2021 July 13. Received 2021 May 7; in original form 2020 December 14

                                                                                                                                                            Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
                                       SUMMARY
                                       We propose methods to efficiently explore the generalized nullspace of (non-linear) inverse
                                       problems, defined as the set of plausible models that explain observations within some misfit
                                       tolerance. Owing to the random nature of observational errors, the generalized nullspace is
                                       an inherently probabilistic entity, described by a joint probability density of tolerance val-
                                       ues and model parameters. Our exploration methods rest on the construction of artificial
                                       Hamiltonian systems, where models are treated as high-dimensional particles moving along a
                                       trajectory through model space. In the special case where the distribution of misfit tolerances
                                       is Gaussian, the methods are identical to standard Hamiltonian Monte Carlo, revealing that its
                                       apparently meaningless momentum variable plays the intuitive role of a directional tolerance.
                                       Its direction points from the current towards a new acceptable model, and its magnitude is
                                       the corresponding misfit increase. We address the fundamental problem of producing inde-
                                       pendent plausible models within a high-dimensional generalized nullspace by autotuning the
                                       mass matrix of the Hamiltonian system. The approach rests on a factorized and sequentially
                                       preconditioned version of the L-BFGS method, which produces local Hessian approximations
                                       for use as a near-optimal mass matrix. An adaptive time stepping algorithm for the numerical
                                       solution of Hamilton’s equations ensures both stability and reasonable acceptance rates of
                                       the generalized nullspace sampler. In addition to the basic method, we propose variations of
                                       it, where autotuning focuses either on the diagonal elements of the mass matrix or on the
                                       macroscopic (long-range) properties of the generalized nullspace distribution. We quantify
                                       the performance of our methods in a series of numerical experiments, involving analytical,
                                       high-dimensional, multimodal test functions. These are designed to mimic realistic inverse
                                       problems, where sensitivity to different model parameters varies widely, and where parameters
                                       tend to be correlated. The tests indicate that the effective sample size may increase by orders
                                       of magnitude when autotuning is used. Finally, we present a proof of principle of generalized
                                       nullspace exploration in viscoelastic full-waveform inversion. In this context, we demonstrate
                                       (1) the quantification of inter- and intraparameter trade-offs, (2) the flexibility to change model
                                       parametrization a posteriori, for instance, to adapt averaging length scales, (3) the ability to
                                       perform dehomogenization to retrieve plausible subwavelength models and (4) the extraction
                                       of a manageable number of alternative models, potentially located in distinct local minima of
                                       the misfit functional.
                                       Key words: Inverse theory; Numerical solutions; Probability distributions; Statistical meth-
                                       ods; Seismic tomography.

                                                                                Imperfections of these data combined with inherent (physical) non-
1 I N T RO D U C T I O N
                                                                                uniqueness and unavoidable simplifications of the equations render
Our knowledge about the internal structure of bodies that are inac-             the solution of any inverse problem ambiguous. Actually solving an
cessible to direct observation, such as the Earth or the human body,            inverse problem therefore requires us to describe the ‘very infinite-
derives from the solution of inverse problems, which assimilate data            dimensional’ manifold of ‘acceptable models’ (Backus & Gilbert
to constrain the parameters m of some forward modelling equations.              1968), that is, models with a misfit χ (m) below some threshold.


C The Author(s) 2021. Published by Oxford University Press on behalf of The Royal Astronomical Society. This is an Open Access

article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
                                                                                                                                                     941

942      A. Fichtner et al.

1.1 Characterizing acceptable solution- and nullspace                      In addition to providing independent models at minimal cost, a
                                                                        generalized nullspace sampler should be adaptable to the needs
Trying to tame the intimidating infinite-dimensionality of the so-
                                                                        of a specific application. In particular, taking a data-oriented
lution space, Backus & Gilbert themselves formalized a series of
                                                                        perspective, it should enable a flexible notion of what makes a
approaches that were to be used for decades to come. These include
                                                                        model acceptable. If, for instance, a currently available model
the linearization of the forward modelling equations, an expansion
                                                                        m0 produces a rather large misfit χ (m0 ), other models may be
of plausible solutions into a finite number of orthogonal basis func-
                                                                        required to have a generally lower misfit in order to be accept-
tions, the computation of parameter averages using optimally δ-like
                                                                        able. Alternatively, models may be acceptable when their associ-
averaging kernels and the solution of constrained least-squares prob-
                                                                        ated misfits fall within a range controlled by the observational error
lems (Backus & Gilbert 1967, 1968, 1970). For the special case of
                                                                        statistics.
linear problems, Wiggins (1972) analysed that part of model space
                                                                           From a model-oriented perspective, a generalized nullspace sam-
for which error-contaminated data provide only weak constraints.
                                                                        pler should have the flexibility to preferentially explore models
He then proposed to construct what would today be called the gener-
                                                                        with predefined and application-specific characteristics. We may,
alized nullspace using singular-value decomposition. Modern vari-
                                                                        for example, be interested in alternative models that contain more
ants of Wiggins’ concept, adapted to higher-dimensional model-
                                                                        small-scale structure or are smoother than our current model m0 .
and nullspaces, can be found in Deal & Nolet (1996) and de Wit
                                                                        Similarly, in the context of quantitative hypothesis testing, we may

                                                                                                                                                  Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
et al. (2012) for linear problems, and in Liu & Peter (2020) for
                                                                        want to find alternative models that contain specific new features to
problems that can be linearized reasonably well.
                                                                        an extent that is compatible with previously assimilated data.
   Efforts, such as the one of Kennett (1978), to characterize the
space of acceptable solutions (or, equivalently, the generalized
nullspace) for non-linear inverse problems have remained few in
number. Instead, increasing computational power enabled Monte           1.3 Outline
Carlo methods, adapted from statistical physics (Metropolis et al.
                                                                        Based on these challenges and desiderata, this manuscript is orga-
1953; Metropolis 1987), to provide finite sets of acceptable mod-
                                                                        nized as follows. In Section 2, we define the generalized nullspace
els by brute-force forward modelling and testing against data (e.g.
                                                                        in terms of a misfit tolerance, which, by virtue of random observa-
Keilis-Borok & Yanovskaya 1967; Press 1968, 1970).
                                                                        tional errors, is a probabilistic quantity. Subsequently, we demon-
   Concerns how to quantitatively digest the potentially large en-
                                                                        strate that the generalized nullspace can be explored using a me-
semble of acceptable models produced by Monte Carlo sampling
                                                                        chanical analogue, whereby model is treated as a particle on a
(e.g. Anderssen & Seneta 1971; Kennett & Nolet 1978) dispersed
                                                                        trajectory controlled by Hamilton’s equations. The classical HMC
with the realization that they may be used to properly sample the
                                                                        algorithm (e.g. Duane et al. 1987; Neal 2011) emerges from this
posterior probability density ρ(m) ∝ e−χ (m) , which in turn could
                                                                        analysis as the special case where the misfit tolerances follow a chi-
be related to a rigorous application of Bayes’ theorem (Mosegaard
                                                                        squared distribution. Via a series of examples we will see that the
& Tarantola 1995). From the samples one may select some that
                                                                        efficiency of the generalized nullspace sampler critically depends
fall within the generalized nullspace, or one may compute lower-
                                                                        on its tuning, and specifically on the artificial mass matrix of the
dimensional quantities, such as means, marginal distributions or
                                                                        particle.
(higher-order) moments.
                                                                           Within this context, Section 3 proposes an autotuning mecha-
   What followed was the development of numerous Monte Carlo
                                                                        nisms. This involves (1) an on-the-fly quasi-Newton approximation
variants that go beyond the classic Metropolis–Hastings algorithm
                                                                        of the misfit Hessian which serves as near-optimal mass matrix
(Hastings 1970) in trying to adapt to the particularities of high-
                                                                        and (2) an adaptive time-stepping approach that ensures stability of
dimensional, non-linear inverse problems. These methods include,
                                                                        the numerical solution of Hamilton’s equations as the mass matrix
but are not limited to, parallel tempering (e.g. Geyer & Thompson
                                                                        changes.
1995; Sambridge 2014), the Neighbourhood Algorithm (Sambridge
                                                                           Section 4 is dedicated to a performance assessment of the pro-
1999a,b), the reversible-jump algorithm used for transdimensional
                                                                        posed autotuning method and some of its variants. For this, we
inversion (e.g. Green 1995; Sambridge et al. 2006, 2013), Hamil-
                                                                        consider high-dimensional and strongly multimodal analytical test
tonian Monte Carlo (HMC, e.g. Duane et al. 1987; Sen & Biswas
                                                                        functions with significant parameter correlations, that are designed
2017; Fichtner et al. 2019) or the Metropolis-adjusted Langevin
                                                                        to mimic misfit surfaces that one may encounter in realistic inverse
algorithm (MALA, e.g. Roberts & Tweedie 1996; Izzatullah et al.
                                                                        problems. In these examples, autotuning helps to reduce the number
2021).
                                                                        of samples needed to achieve convergence by more than one order
                                                                        of magnitude.
                                                                           Encouraged by these results, Section 5 presents a generalized
1.2 Challenges and desiderata
                                                                        nullspace exploration for 1-D viscoelastic full-waveform inversion,
Despite undeniable progress, challenges remain. Arguably the most       which enables, for example, the detection of different misfit minima.
important among these is the efficient computation of acceptable        The ability to treat high-dimensional models spaces allows us to
models that are independent, that is, significantly different from      parametrize the model at subwavelength scale, and to choose some
each other. As the model space dimension Nm grows, the proba-           spatial parameter averaging a posteriori, for instance, as a function
bility of completely randomly drawing a model m that happens to         of the desired certainty. Furthermore, we propose an algorithm that
be acceptable, decreases superexponentially (e.g. Tarantola 2005;       extracts a manageable number of acceptable models that are at a
Fichtner 2021). Increasing the acceptance rate of trial models may      predefined minimum distance from each other.
require very small steps from the current model to a new candidate         Finally, in Section 6, we discuss, among other aspects, the relation
model, which leads to both an explosion of computational cost and       of our method to (1) previous work in Hessian-aware Monte Carlo
slow convergence of the sample chain (e.g. Geyer 1992; Krass et al.     sampling, (2) dehomogenization and the construction of alternative
1998; MacKay 2003; Gelman et al. 2013).                                 small-scale models and (3) non-linear full-waveform inversion.

Autotuning HMC for nullspace exploration                943

2 R A N D O M I Z E D N U L L S PA C E                                     When p0 is chosen such that
E X P L O R AT I O N
                                                                                       1 T −1
We begin with the notion of a generalized nullspace. For this we           K (p0 ) =    p M p0 = ε,                                             (6)
                                                                                       2 0
assume the existence of an estimated plausible model m0 with misfit
χ0 = χ (m0 ), which approximately minimizes the misfit functional          Eq. (6) implies
χ . The estimate may have been found using gradient-based (e.g.
Nocedal & Wright 1999) or stochastic (e.g. Sen & Stoffa 2013;              χ [m(τ )] ≤ χ (m0 ) + ε,                                             (7)
Fichtner 2021) methods, or it may represent a priori knowledge
                                                                           because the positive definiteness of the mass matrix M ensures
from previous analyses. Due to observational uncertainties, forward
                                                                           p(τ )T M−1 p(τ ) > 0 for all momenta p(τ ). Consequently, all models
modelling errors and inherent non-uniqueness, alternative models,
                                                                           m(τ ) along the trajectory are within the generalized nullspace.
m0 + m, are still plausible when the associated misfit increase
                                                                              While the Hamiltonian system constructed for nullspace explo-
remains below some tolerance ε ≥ 0, that is,
                                                                           ration seems artificial, eq. (6) injects concrete physical meaning into
χ (m0 + m) ≤ χ0 + ε.                                               (1)    the momentum variable p. In fact, p0 plays the role of an initial di-
                                                                           rectional tolerance. Its M−1 -norm ||p0 ||2M = 12 p0T M−1 p0 determines
The ensemble of tolerable models m0 + m constitutes the gener-            the maximum admissible misfit increase, and its direction controls

                                                                                                                                                       Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
alized nullspace (Deal & Nolet 1996).                                      the initial direction in model space along which alternative models
                                                                           are sought. In fact, as shown in Fichtner & Zunino (2019), the model
                                                                           perturbation applied by the nullspace shuttle during the initial and
2.1 Hamiltonian nullspace shuttle                                          infinitesimally short part of its trajectory is proportional to M−1 p0 .
                                                                           Hence, p0 may be used to insert specific features into alternative
For a given tolerance ε, generalized nullspace exploration can
                                                                           models. The mass matrix may then modify these features, making
be achieved through the interpretation of a model m as the Nm -
                                                                           them, for example, rougher or smoother. Hamilton’s eq. (4) govern
dimensional position vector of an imaginary particle, also referred
                                                                           the change of the directional tolerance, from the initial p0 to some
to as the nullspace shuttle (Fichtner & Zunino 2019; Fichtner 2021).
                                                                           p(τ ).
The position of the particle varies as a function of an artificially in-
troduced time τ , meaning that different τ correspond to different
members of model space, m(τ ). To determine the movement of the
particle through model space, we construct artificial equations of
motion, borrowing concepts from classical mechanics (e.g. Symon            2.2 The probabilistic generalized nullspace
1971; Landau & Lifshitz 1976). First, we equate the (not necessarily
                                                                           Random errors in the observed data vector dobs cause the generalized
positive definite) misfit χ (m) with an artificial potential energy of
                                                                           nullspace to be an inherently probabilistic entity. The repetition of
the particle,
                                                                           the experiment, in reality or hypothetically, would yield a different
U (m) = χ (m).                                                      (2)    realization of dobs and a different misfit χ 0 . Hence, for a given m0
                                                                           the distribution of misfits is characterized by a probability density
The potential energy, most intuitively imagined as a gravitational         ρ(χ |m0 ). The random nature of χ translates to the tolerance ε. Had
energy, induces a force −∇U [m(τ )], parallel to the direction of          we, for instance, obtained a smaller misfit for m0 by chance, we
steepest descent. Hence, within some time increment δτ , the po-           would possibly accept a larger tolerance, and vice versa. Therefore,
tential energy acts to move m(τ ) towards a new model m(τ + δτ )           we may equally describe the distribution of ε by a probability density
with lower misfit. The ‘gravitational’ force parallel to −∇U [m(τ )]       ρ(ε|m0 ).
is complemented by an inertial force, related to an artificial momen-         The directional tolerance p0 inherits the probabilistic character
tum p(τ ), which also has dimension Nm . Together with an equally          of the scalar tolerance ε, but its distribution ρ(p0 |m0 ) is not solely
artificial, symmetric and positive-definite mass matrix M, the mo-         controlled by the misfit statistics. In fact, considering eq. (6), we
mentum defines the kinetic energy                                          may obtain some p0 for a specific realization of ε by (1) drawing a
          1 T −1                                                           vector q from an arbitrary probability distribution, (2) rescaling q
K (p) =     p M p.                                                  (3)    such that qT q = 2ε and (3) setting p0 = Sq, where M = SST is a
          2
                                                                           factorization of the mass matrix. Clearly, the vector p0 , intuitively
The sum of potential and kinetic energies, that is, the total energy       interpretable as the initial take-off direction of the nullspace shuttle,
of the artificial mechanical system, is the Hamiltonian H (m, p) =         depends on the mass matrix M, which we are free to choose, as long
U (m) + K (p). In terms of H, the trajectory of the Nm -dimensional        as it is symmetric and positive definite.
particle is fully determined by Hamilton’s equations                          As schematically illustrated in Fig. 1, the design of M can be
dm i   ∂H           d pi    ∂H                                             used to introduce additional information or desiderata about the di-
     =      ,            =−      ,        i = 1, ..., Nm .          (4)    rection along which alternative models should be found. This may
dτ     ∂ pi         dτ      ∂m i
                                                                           include average properties of the take-off directions, subjective pref-
Along any trajectory in phase (model-momentum) space, H is pre-            erences, or the need to incorporate new, independent information
served. Hence, starting at some approximate minimum m0 of χ and            into a model without deteriorating the fit to previously included
some initial momentum p0 , the solution of eq. (4) leads to a contin-      data. The precise meaning of the generalized nullspace depends
uous sequence of models m(τ ) and momenta p(τ ) that satisfy               on how we construct M and therefore ρ(p0 |m0 ). To avoid overly
                              1                                            abstract developments, we will present application-specific exam-
H [m(τ ), p(τ )] = χ [m(τ )] + p(τ )T M−1 p(τ ) = H (m0 , p0 )             ples of ρ(p0 |m0 ) throughout the following sections. An expanded
                              2
                                                                           collection of possible tolerance distributions, including the special
                        1 T −1
                 = χ 0 + p0 M p 0 .                                 (5)    case of zero tolerance, can be found in Appendix A.
                        2

944         A. Fichtner et al.

  (a) statistically isotropic p0   (M=I)                (b) statistically anisotropic p0   (M I)                    (c) objectively/subjectively biased p0

                                                                                                                                                       feature X
                                                                                                                                     smoother         suppressed

                                           p0                                                         p0              preferred by                                   p0
                                                                                                                         myself

                                                                                                                      sharper
                                                                                                                       edges                                           preferred by
                                                                                                                                                                         my boss

                                                                                                                         feature X                                 rougher
                                                                                                                         enhanced

                                                                                                                                                     consistent with new
                                                                                                                                                   independent information

Figure 1. Schematic illustration of the probability distribution for directional tolerances ρ(p0 |m0 ) as a function of the mass matrix M. The radius of the pale

                                                                                                                                                                                        Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
dashed circle equals a specific realization of the scalar tolerance ε, and the blue arrow marks a specific p0 , which can be interpreted as the initial take-off
direction of the nullspace shuttle. (a) When M = I, the distribution of initial momenta is isotropic. (b) For M = I, certain directions will be preferred, meaning
that the distribution is statistically anisotropic. (c) The mass matrix M may also be designed to favour specific directions that are in accord with personal
preference or new independent information.

  The product of the tolerance distribution ρ(p|m) and the model                             2021), which recently gained attention in geophysics for its abil-
space posterior ρ(m),                                                                        ity to solve inverse problems of comparatively high dimension (e.g.
                                                                                             Sen & Biswas 2017; Fichtner & Simute 2018; Fichtner et al. 2019;
ρ(p, m) = ρ(p|m)ρ(m),                                                         (8)
                                                                                             Gebraad et al. 2020; Kotsi et al. 2020; Muir & Tkalčić 2020). In
defines a joint probability density in tolerance-model space. This                           HMC, scalar tolerances ε are drawn from a chi-squared distribu-
generalized nullspace distribution describes the combined informa-                           tion with n degrees of freedom, independent of the current model
tion on alternative models and misfit tolerances that we are willing                         mi . This means that acceptable misfit increases scale with model
to accept. For a fixed directional tolerance p, the joint distribution                       space dimension (Appendix A3). The corresponding distribution
ρ(p, m) provides the likelihoods of acceptable models. Conversely,                           of the directional tolerances pi is the Nm -dimensional Gaussian
for a fixed model m, it gives the likelihood of accepting a certain                          with covariance matrix M, also independent of the current posi-
misfit increase. Finally, integrating (marginalizing) ρ(p, m) over the                       tion mi in model space. A conceptual difference between Hamilto-
tolerances p, returns the model space posterior ρ(m).                                        nian nullspace sampling and HMC is the initial model m0 . While
                                                                                             nullspace sampling assumes that m0 is already an acceptable model,
                                                                                             m0 is drawn randomly in HMC. After sufficiently many samples, the
2.3 Sampling the generalized nullspace distribution                                          influence of the initial model will diminish, and so this difference
                                                                                             disappears asymptotically. Yet, as noted by Geyer (2011), Monte
For problems of sufficiently low dimension, the complete joint dis-
                                                                                             Carlo methods in general benefit from choosing an acceptable m0 ,
tribution ρ(p, m) may be explored by brute-force grid search. How-
                                                                                             as this may eliminate or at least shorten the burn-in phase, which is
ever, when the model space dimension is high, that is, typically
                                                                                             otherwise needed to approach the typical set.
above a few tens or hundreds, we typically need to limit ourselves
                                                                                                In many applications, Hamilton’s equations cannot be integrated
to the Monte Carlo approximation of lower-dimensional quantities.
                                                                                             analytically, meaning that numerical integrators must be used to
These may include moments of the distribution (means, variances,
                                                                                             obtain approximate solutions. While the numerical approximation
...), marginal probability densities, or other lower-dimensional char-
                                                                                             may affect the conservation of energy, the sampling algorithm pre-
acteristics of the posterior.
                                                                                             sented above remains valid as long as the numerical integrator is
    As proven in Appendix B, the Hamiltonian nullspace exploration
                                                                                             symplectic, that is, time-reversible and volume-preserving (see Ap-
described in Section 2.1 provides a mechanism for the Monte Carlo
                                                                                             pendix B).
sampling of ρ(p, m), which we summarize in the following algo-
rithm:
   (1) Starting from m0 , randomly draw a directional tolerance p0                           2.4 Examples
from ρ(p|m0 ).
   (2) Propagate (p0 , m0 ) for some time T along a Hamiltonian                              2.4.1 The 1-D harmonic oscillator
trajectory towards the test momentum/model (p(T ), m(T )).                                   For the purpose of illustration, we begin with the simple example
   (3) Accept         (p(T ), m(T ))       with        probability                           of inferring the circular frequency m of a 1-D harmonic oscillator
min [1, ρ(p(T ), m(T ))/ρ(p0 , m0 )] .                                                       from observations of its amplitude
   In case of acceptance, set m(T ) → m1 and repeat the proce-                               u(t) = 1.2 sin(mt),                                                                  (9)
dure by drawing p1 according to step 1). Otherwise, continue
with m0 retry the procedure, as before. The resulting Markov                                 at few irregularly spaced observations times t1 , ..., t Nd , as illustrated
chain, (p0 , m0 ), (p1 , m1 ), ..., has ρ(p, m) as equilibrium distribu-                     in Fig. 2(a). Problems of this kind appear, for instance, in Doppler
tion, meaning that the sampling density is proportional to ρ(p, m).                          spectroscopy for the detection of exo-planets (e.g. Struve 1952;
   The most noteworthy special case of this algorithm is HMC                                 Mayor & Queloz 1995), and the estimation of stellar oscillation
(e.g. Duane et al. 1987; Neal 2011; Betancourt 2017; Fichtner                                periods (e.g. Dworetsky 1983; Bourguignon et al. 2006). We assume

Autotuning HMC for nullspace exploration                       945

   (a)                                        (b)                                                      (c)

                                                                                                                                                                         Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
Figure 2. 1-D harmonic oscillator. (a) Amplitude of an oscillator with circular frequency mtrue = 1. Randomly perturbed (noisy) amplitude observations are
shown as blue dots, and an estimated model with period m0 = 1.1 as grey curve. (b) A random collection of Hamiltonian trajectories drawn from the tolerance
distribution χ (ε|m0 ) = ρ(χ 0 + ε|m0 ) + kδ(ε) is shown in the top panel. Colour coding corresponds to the actual tolerance value ε, with larger values plotted
in more intense tones of red. The ith trajectory follows an iso-curve of the Hamiltonian H = U + K and samples misfit values below U(mi ) + ε i , as plotted in
identical colours in the lower panel. For better visibility, kinetic energies (misfits) of individual trajectories are slightly offset from the complete misfit curve,
shown as grey curve. The kinetic energy of the initial model m0 = 1.1 is indicated by a grey line. (c) Joint distribution ρ(p, m) = ρ(p|m)ρ(m) with exponential
colour scale. Models with low misfit, for instance around m = 1.0, admit larger tolerances, and vice versa.

that the elements of the observed data vector dobs, i = uobs (ti ) are                2.4.2 Exploring a high-dimensional Gaussian
independently polluted by noise with a standard normal distribution,
                                                                                      In the important class of linear inverse problems with normally
justifying the use of the root-mean-square misfit
                                                                                      distributed observational errors, the misfit χ (m) takes the form
          
          Nd
                                                                                                 1 T −1
χ (m) =         [di (m) − dobs,i ]2 .                                     (10)        χ (m) =      m C m,                                                        (11)
                                                                                                 2
          i=1
                                                                                      with some covariance matrix C (e.g. Parker 1994; Tarantola 2005;
For a fixed estimated frequency m0 , the Gaussian observation errors                  Menke 2012). While still being a simplistic case, it provides use-
cause the misfit distribution ρ(χ |m0 ) to be a non-central chi-square                ful insight into the mechanics of Hamiltonian nullspace sampling,
distribution (Abramowitz & Stegun 1972) with an estimated non-                        especially when the eigenvalues of C differ by several orders of
centrality parameter λ ≈ max (χ 0 − Nd , 0) (e.g. Saxena & Alam                       magnitude.
1982). Therefore, a plausible distribution of the tolerance ε ex-                        Here, we consider a 1000-D model space and a diagonal co-
presses the probability of obtaining a misfit χ that exceeds χ 0 . As                 variance matrix with entries ranging linearly from C1, 1 = 0.01 to
shown in Appendix A1, this distribution is given by χ (ε|m0 ) = ρ(χ 0                 C1000, 1000 = 1.0. Furthermore, to make the explicit link to HMC,
+ ε|m0 ) + kδ(ε), with a constant k.                                                  we draw directional tolerances p from an Nm -dimensional Gaussian
   Successively drawing random  √ tolerances εi from χ (ε|mi ) pro-                   with covariance M = I, meaning that the mass matrix M equals
vides initial momenta pi = ± 2Mε for Hamiltonian nullspace ex-                        the unit matrix. It follows that the generalized nullspace sampling
ploration. The sign can be chosen arbitrarily because momentum                        introduced in Section 2.3 produces samples of the joint distribution
space is symmetric in p. The same is true for the mass M, which is                                     1 Tp      1     T C−1 m
balanced against pi to always yield the same initial kinetic energy,                  ρ(p, m) ∝ e− 2 p        e− 2 m             .                               (12)
according to (6). In this example, we choose the plus sign and M =
1. Since Hamilton’s equations for this case cannot be solved ana-                     Fig. 3 summarizes the result after drawing 3000 samples, of which
lytically, we rely on a numerical approximation, which we compute                     the first 1000 are ignored as burn-in. As in the previous example,
using the leapfrog method (Appendix C). A collection of Hamil-                        we solve Hamilton’s equations using the leapfrog algorithm (Ap-
tonian trajectories for tolerances drawn from χ (ε|mi ) is shown in                   pendix C). While the approximated 1-D marginal of parameter m1
Fig. 2(b). Each trajectory traces an iso-line of the total energy H(p,                in Fig. 3 a resembles the desired Gaussian with standard deviation
m), thereby reaching alternative models with misfit below χ i + εi .                  0.1, the 1-D marginal of m1000 appears bimodal instead of Gaussian,
   Following the sampling procedure described in Section 2.3 en-                      indicating that the number of samples is insufficient. The seemingly
sures that the trajectory end points sample the generalized nullspace                 different convergence speeds can be explained with the sample au-
distribution ρ(p, m), displayed in Fig. 2(c). As intuitively expected,                tocorrelations of the two components,
smaller misfits (larger probabilities for some ε = const.) admit                                N
                                                                                                     m l,i m l+k,i
larger tolerances, and vice versa.                                                    ci (k) = l=1N
                                                                                                                   ,                                  (13)
                                                                                                   l=1 m l,i m l,i

946 A. Fichtner et al.

(a) (b) (c) (d)

Figure 3. Summary of HMC sampling of the 2×1000-D Gaussian in eq. (12). The model covariance matrix C is diagonal, with elements ranging linearly
from C1, 1 = 0.01 to C1000, 1000 = 1.0. The mass matrix M equals the unit matrix I. Of the 3000 samples used, 1000 are ignored as burn-in. (a,b) 1-D
marginals for parameters m1 and m1000 . (c) Autocorrelations averaged over 100 HMC runs of the sample chains for m1 and m1000 , with corresponding effective
sample fractions. (d) 2-D projection of a representative Hamiltonian trajectory (red, starting point in blue), with the target Gaussian shown in greyscale in the

Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
background.

where N is the number of samples (excluding burn-in). The sam- 0.15, meaning that one in around six samples is statistically inde-
ple autocorrelations in Fig. 3(c) reveal that the m1 -components of pendent. These properties are reflected in the 2-D projection of a
successive samples are practically uncorrelated. In contrast, the representative trajectory in Fig. 4(d), which makes similarly fast
m1000 -components are correlated over hundreds of samples, sug- progress in m1 - and m1000 -direction.
gesting that model space exploration in m1000 -direction is vastly
less efficient. In addition to being a qualitative measure of sample
dependence, the autocorrelation plays an important role in error 2.5 Problem statements and outlook
estimates of Monte Carlo integration (e.g. Geyer 1992) and permits
estimates of the effective sample size (e.g. Ripley 1987; Krass et al. In the previous sections, we introduced a framework for the ex-
1998; Gelman et al. 2013), plicit computation of alternative models with misfit below a defined
threshold, and the sampling of the generalized nullspace distribu-
N tion ρ(p, m). While being conceptually straightforward, the main
Neff = ∞ . (14)
1+2 k=1 ci (k) difficulty lies in the selection of tuning parameters that ensure the
efficient computation of models that are independent. These tuning
For uncorrelated√ samples, the Monte Carlo integration
√ error is pro-
parameters include the mass matrix M, the integration time step τ ,
portional to 1/ N , but only proportional to 1/ N eff , when samples
and the total length of the trajectory T . Each of these tuning param-
are correlated. Therefore, the effective sample fraction Neff /N serves
eters comes with its own subproblem, explained in the following
as an exchange rate that accounts for sample correlation. In prac-
paragraphs.
tice, the infinite sum in (14) must be approximated by a truncated
version, because the available number of samples is finite, and be-
cause ci (k) has a long noisy tail (Bartlett 1946). We follow (Gelman
et al. 2013) in terminating the summation when the sum of two suc- 2.5.1 The mass matrix and local Hessian approximations
cessive autocorrelation values is zero for the first time. Applied to Section 2.4.2 suggests that the mass matrix M should approximate
our example, we obtain effective sample fractions of 0.3286 for m1 the local Hessian H(m) of χ (m). When χ (m) is quadratic, as in
and 0.0047 for m1000 , meaning that only one in around 1/0.0047 ≈ eq. (11), we simply have H = C−1 . In the majority of applications,
213 samples is statistically independent in m1000 -direction. Though however, H(m) is a priori unknown, and it can neither be computed
numerous other definitions and implementations of the effective nor stored explicitly. Furthermore, a factorization H = SST , needed
sample size have been proposed (e.g. Kong 1992; Martino et al. to draw samples of the directional tolerance p, as described in
2016), we will adhere to the version introduced above, as it can be Section 2.2, is usually unavailable.
easily computed and interpreted. We address the local approximation of the Hessian in Section 3
The differences in effective sample fractions for m1 and m1000 with the formulation of a factorized version of the L-BFGS method,
can be understood by examining a typical Hamiltonian trajectory, known from non-linear optimization (Nocedal 1980; Liu & Nocedal
shown in Fig. 3(d). In m1 -direction, the artificial particle makes 1989; Nocedal & Wright 1999), and recently applied to geophysical
rapid progress, exploring different parts of model space. In con- inverse problems (e.g. Prieux et al. 2013; Métivier & Brossier 2016;
trast, progress in m1000 -direction is comparatively slow, meaning Modrak & Tromp 2016; Thrastarson et al. 2020; van Herwaarden
that all models along the trajectory have strongly correlated m1000 - et al. 2020).
components.The trajectory in Fig. 3(d) also suggests a solution to
the problem of widely varying convergence speed, at least in the
case of the simple quadratic misfit (11). In fact, changing the mass
2.5.2 Integration length, Poincaré recurrence and energy drift
matrix from M = I to M = C−1 causes the trajectories to oscillate
equally fast in all directions (Fichtner et al. 2019). As a conse- To some extent, the strong correlation of successive samples, il-
quence, the 1-D marginals for m1 and m1000 , shown in Figs 4(a) and lustrated in Section 2.4.2, could be overcome by computing longer
(b), are both approximately Gaussian, with variances of around 0.01 Hamiltonian trajectories, that is, by increasing the integration length
and 1.0, respectively. Furthermore, instead of being correlated over T . However, in addition to being computationally expensive, this
hundreds of samples, both effective sample fractions are around approach is inherently limited by Poincaré recurrence (e.g. Poincaré

Autotuning HMC for nullspace exploration                 947

   (a)                               (b)                                (c)                                      (d)

Figure 4. Summary of HMC sampling with a setup identical to the one in Fig. 3, except for choosing the mass matrix M = C−1 . (a,b) 1-D marginals for
parameters m1 and m1000 . (c) Autocorrelations averaged over 100 HMC runs of the sample chains for m1 and m1000 , with corresponding effective sample
fractions. (d) 2-D projection of a representative Hamiltonian trajectory (red, starting point in blue), with the target Gaussian shown in greyscale in the

                                                                                                                                                             Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
background.

1890; Landau & Lifshitz 1976; Stephani & Kluge 1995). The ar-                   3 AU T O T U N I N G B Y L I M I T E D - M E M O RY
bitrarily close return of a trajectory to a previously visited point            Q UA S I - N E W T O N U P D AT I N G
after some sufficiently long time will introduce correlations that we
                                                                                For didactic reasons, we present the autotuning approach in sev-
actually wanted to avoid.
                                                                                eral small steps, starting with a condensed summary of the BFGS
   An equally profound complication is energy drift, that is, imper-
                                                                                method, and ending with a collection of measures that help to im-
fect energy conservation of numerical integrators for Hamilton’s
                                                                                prove convergence and stability of the method.
equations (e.g. Toxvaerd 1994; Toxvaerd et al. 2012). As shown in
Appendix C2, energy conservation of the leapfrog method, and of
the nearly identical Verlet integrator (Verlet 1967), is only correct
                                                                                3.1 BFGS updating of the mass matrix
to first order in the integration time step τ . Though this may be
improved with high-order integrators (Yoshida 1990; Martyna &                   In the context of non-linear optimization, the BFGS method (Broy-
Tuckerman 1995), exact energy conservation requires implicit in-                den 1970; Fletcher 1970; Goldfarb 1970; Shanno 1970; Nocedal &
tegration schemes (Simo et al. 1992; Quispel & McLaren 2008),                   Wright 1999) is used for the iterative approximation of the local
which are computationally out of scale for high-dimensional inverse             inverse Hessian H−1 , which then serves as a computationally less
problems.                                                                       expensive substitute of the exact Hessian in a Newton iteration. To
   In the context of generalized nullspace sampling, energy drift               summarize BFGS, we introduce the auxiliary vectors
has two main effects: (1) the misfit of models along a Hamil-
tonian trajectory may not actually be below the defined toler-                  sk = mk+1 − mk ,          yk = ∇U (mk+1 ) − ∇U (mk ),                (15)
ance and (2) the acceptance rate of the random sampling intro-                  where the subscript k denotes the sample index. Starting from a
duced in Section 2.3 may drop substantially because the accep-                  positive definite initial guess H−1
                                                                                                                 0 of the inverse Hessian of U, the
tance criterion involves the ratio between the initial and the final            BFGS iteration computes successive approximations of H−1 as
Hamiltonian.                                                                                          −1               
   In the context of standard HMC, where the directional toler-                 H−1
                                                                                  k+1 = I − ρk sk yk Hk
                                                                                                    T
                                                                                                             I − ρk yk skT + ρk sk skT ,      (16)
ance distribution is Gaussian, the integration length T has received
                                                                                with the scaling factor
considerable attention (e.g. Mackenzie 1989; Hoffmann & Gelman
2014). Based on the local Hessian approximation, we present in                           1
                                                                                ρk =          .                                                      (17)
Section 3.4.2 a semi-analytical argument for the suitable choice of                    ykT sk
the integration length, which empirically works well in numerical
                                                                                The latter must be strictly positive to ensure that successive up-
experiments.
                                                                                dates are positive definite. The corresponding Hk+1 easily follows
                                                                                from the Sherman–Morrison formula (Bartlett 1951; Nocedal &
                                                                                Wright 1999). Assuming that Hk+1 approximates the local Hessian
                                                                                H(mk+1 ), we may use Hk+1 as mass matrix to draw the subse-
                                                                                quent sample mk+1 . This approach raises two issues: (1) Generat-
                                                                                ing random momenta, or directional tolerances p, as described in
2.5.3 Numerical stability and adaptive time stepping                            Sections 2.2 and 2.3, requires a factorization of the mass matrix,
                                                                                M = SST , for instance, a Cholesky decomposition. However, the
The numerical stability of leapfrog, and of any other explicit inte-
                                                                                number of operations needed to compute S is of order Nm3 , meaning
grator, depends on the eigenvalues of the mass matrix M (see, for
                                                                                that it is out of scale for many relevant applications. (2) The matrices
instance, Appendix C1). Hence, as M changes during the gen-
                                                                                H−1
                                                                                  k may be too large to be stored.
eralized nullspace sampling, the integrator may become unsta-
ble. To prevent such instability, we propose in Section 3.4.3 an
adaptive time stepping scheme, where the integration time step
                                                                                3.2 Factorized BFGS updating (F-BFGS)
can be adjusted. It rests entirely on estimates of energy con-
servation, thereby avoiding the need to compute eigenvalues of                  To produce a scalable algorithm, we aim to compute the matrix
M.                                                                              factor S directly, using a modified BFGS update equation. Following

948         A. Fichtner et al.

an approach indicated by Broodlie et al. (1973), we first write the       3.3 Limited-memory factorized BFGS updating
regular BFGS update from eq. (16) in the factorized form                  (LF-BFGS)
                  −1            T
H−1
  k+1 = I + uk vk Hk
                T
                         I + uk vkT ,                          (18)       The factorized updating formulae (28) straightforwardly enable a
                                                                          limited-memory approach, similar to the standard limited-memory
with two vectors uk and vk that remain to be determined. First, we
                                                                          BFGS concept of Nocedal (1980) and Liu & Nocedal (1989). In
note that eq. (16) may be expanded to
                                                                          fact, letting h be some arbitrary vector, we may write
H−1     −1
 k+1 = Hk + ak ak − bk ak − ak bk ,
                T       T       T
                                                                  (19)               
                                                                                              vk ukT
                                                                          Sk+1 h = I −
where we defined the auxiliary variables                                                   1 + vkT uk
                                                                                                                
γk2 = ρk2 ykT H−1
               k yk + ρk ,                                       (20a)                                 T
                                                                                                 vk−1 uk−1               v0 u0T
                                                                                     × I−                    ... I −              S0 h. (29)
                                                                                               1 + vk−1 uk−1
                                                                                                     T
                                                                                                                       1 + v0T u0
ak = γk sk ,                                                     (20b)
                                                                          Typically, the initial matrix S0 equal the unit matrix I. Defining
     ρk                                                                   h0 = S0 h, eq. (29) takes the form of a sequential update,
bk = H−1    yk .                                          (20c)
     γk k                                                                        
                                                                                           vi uiT                    uiT hi
                                                                          hi+1 = I −                 hi = hi − vi             , i = 0, ..., k,

                                                                                                                                                      Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
Comparing the expanded form of (18) and (19), motivates the fol-
                                                                                        1 + vi ui
                                                                                                T
                                                                                                                   1 + viT ui
lowing ansatz for the vectors uk and vk ,
                                                                                                                                            (30)
uk = ak ,                                                        (21a)
                                                                          which eventually gives
vk = −Hk (bk + θ ak ),                                           (21b)    Sk+1 h = hk+1 .                                                     (31)
with some scalar θ . To find θ , we substitute eqs (21a) and (21b) into   Most importantly, eq. (30) only contains vector–vector products,
eq. (18),                                                                 eliminating the need to explicitly compute and store any matri-
                                                         
H−1       −1
  k+1 = Hk + ak ak (bk + θ ak ) Hk (bk + θ ak ) − 2θ
                     T              T
                                                                          ces. Furthermore, the sequence can be limited to the last + 1
            − bk akT − ak bkT .                                   (22)    vector pairs (uk , vk ), ..., (uk− , vk− ), which further reduces storage
                                                                          requirements at the expense of a more incorrect but hopefully still
The comparison of eq. (22) to eq. (19) shows that θ must satisfy the      acceptable Hessian approximation (Nocedal & Wright 1999). Fol-
quadratic equation                                                        lowing this approach, similar equations can be found for products
(bk + θ ak )T Hk (bk + θ ak ) − 2θ = 1,                           (23)    of h with the inverse and transpose of Sk+1 .

or, slightly reordered,
 T                                                                  3.4 Further measures to improve convergence and stability
 ak Hk ak θ 2 + 2 akT Hk bk − 1 θ + bkT Hk bk − 1 = 0.            (24)
To express the polynomial coefficients in eq. (24) in terms of sk and     3.4.1 Iterative updating of the initial matrix
yk , we re-substitute eqs (20a) and (20b), which leads to
                                                                          Despite being, for simplicity, presented as a constant in Section 3.3,
 2 T               ρk
 γk sk Hk sk θ 2 = 2 .                                           (25)     the initial matrix factor S0 can and should be updated to improve
                     γk                                                   convergence. A constant S0 implies that the LF-BFGS algorithm
Eq. (25) yields two real-valued solutions for θ provided that ρ k >       has a memory of samples, which may be small compared to the
0, which is identical to the condition needed to ensure positive-         model space dimension Nm . Updating S0 may increase the memory,
definite BFGS updates (e.g. Nocedal & Wright 1999). The previous          meaning that more than samples effectively contribute to the
set of equations provides a simple recipe for the factorized BFGS         Hessian approximation.
(F-BFGS) updating of H−1 k based on the computation of the vectors
                                                                             Most straightforwardly, S0 is replaced in regular intervals, typi-
uk and vk through eqs (20) and (21). A factorized update of Hk now        cally every samples, by the square root of the diagonal elements
follows directly from the inversion of eq. (18),                          of the current Hessian approximation, that is,
                  −1              −1
Hk+1 = I + vk ukT      Hk I + uk vkT     ,                     (26)         diag Hk → S0 .                                                    (32)
combined with the Shermann–Morrison formula for the inverse of            We note that any updating of S0 requires a recalculation of the vector
rank-one updates (Bartlett 1951; Nocedal & Wright 1999)                   sequences u0 , u1 , ... and v0 , v1 , ..., as they depend on S0 .
                −1           vk ukT
    I + vk ukT         =I−              .                         (27)
                             1 + vkT uk                                   3.4.2 Integration length
Assuming that a factorization Hk = Sk SkT is available from previous      The choice of a suitable integration length T is a balancing act
F-BFGS iterations, eqs (26) and (27) imply that the updated matrix        between a large T that ensures rapid model space exploration and
factor Sk+1 and its inverse S−1
                             k+1 are given by                             a small T to limit computational cost. Fortunately, using the LF-
                                                                         BFGS Hessian Hk as mass matrix provides some useful guidance.
                vk ukT                                     
Sk+1 = I −                 Sk ,     S−1
                                      k+1 = Sk
                                              −1
                                                  I + vk ukT .  (28)      In fact, as Hk , and therefore M, approach the true Hessian H,
              1 + vkT uk
                                                                          the Hamiltonian trajectories converge towards segments of Nm -
Knowing the matrix factors Sk+1 and S−1k+1 , allows us to compute all     dimensional circles with circular frequency 2π . Hence, in the case
(inverse) Hessian-vector products and to generate random momenta          of a roughly constant Hessian, we observe approximate Poincaré
from a Gaussian with covariance M = Hk . We parenthetically re-           recurrence for T = 2π , and about half the trajectory has been tra-
mark that Sk is usually dense and not a Cholesky factor of Hk .           versed for T = π .

Autotuning HMC for nullspace exploration 949

When the Hessian is not approximately constant, the above argu- In the following section, we introduce two variations of the au-
ment looses precision. Nevertheless, setting T ≈ π with some ran- totuning approach that preserve the Markov property, possibly at
dom variations to avoid cyclic behaviour of the sampler (Mackenzie the expense of reduced efficiency (depending on the specifics of an
1989) is an empirically useful choice that we adopted in all of the application).
following examples.

3.6 Variations of the theme
3.4.3 Initial and adaptive time stepping
The set of algorithms presented in Sections 3.1–3.4 provides a
The (leapfrog) integration time step τ is controlled by the need general autotuning framework. As suggested by the No-Free-Lunch
to (1) conserve energy of the nullspace shuttle, (2) maintain high theorem (e.g. Wolpert & Macready 1997; Mosegaard 2012), its
acceptance rates of the nullspace sampler and (3) ensure numerical efficiency may be increased through slight adaptations that account
stability. As demonstrated in Appendix C1, numerical stability re- for prior knowledge. Two possible adaptations that we will revisit in
quires τ ≤ 2/ λmax (M−1 H), where λmax (M−1 H) is the maximum later numerical examples are presented in the following paragraphs.
eigenvalue of the matrix product M−1 H.
An initial estimate of a conservative τ may be obtained by

Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
simple trial-and-error, that is, the testing of candidate time steps
3.6.1 Diagonal freezing
until integration is stable. Alternatively, one may have estimates of
the maximum eigenvalue of the Hessian near the estimated model, In cases where the Hessian is a priori known to be roughly diagonal
λmax [H(m0 )], from physical arguments or the application of second- and roughly invariant, the generalized nullspace sampling may be
order adjoint methods (e.g. Santosa & Symes 1988; Pratt et al. accelerated by estimating the diagonal of the Hessian using few
1998; Fichtner & Trampert 2011). Setting the initial mass matrix to very short sample chains starting from different initial models. The
λmax [H(m0 )] I then causes the maximum allowable τ to be around resulting approximation of the Hessian diagonal is then used as a
2, meaning that τ ≈ 1 is likely to be a useful and conservative constant mass matrix in a sample chain that is sufficiently long to
starting point. ensure convergence.
Successive updating of the mass matrix with the LF-BFGS Hes- The freezing of the diagonal after its initial estimation has the
sian affects numerical stability because the maximum eigenvalue of advantage of avoiding both the computational cost of on-the-fly au-
M−1 H changes. Since repeated eigenvalue estimations are compu- totuning and the potential bias introduced by an otherwise inexact
tationally out of scale, an alternative approach for the adjustment Markov chain (see Section 3.5). These advantages have to be bal-
of τ is needed. For this, we may exploit the otherwise undesirable anced on a case-by-case basis against the disadvantage of ignoring
fact that energy conservation of the leapfrog scheme is only correct off-diagonal elements and a non-constant Hessian. An example of
to first order in τ , as shown in Appendix C2. The deterioration of the diagonal freezing approach is presented in Section 4.2.
energy conservation may therefore be used as a proxy for upcoming
numerical instability.
In practice, the adaptation of τ is most easily implemented by 3.6.2 Macroscopic autotuning
monitoring the acceptance rate R averaged over roughly samples.
The decrease of R below some threshold Rmin relates directly to When the generalized nullspace has fine-scale structure, for in-
decreasing energy conservation, suggesting that τ should be re- stance, in the form of numerous local minima, superimposed on
duced to a smaller value γ τ with γ < 1. Conversely, when R some broad-scale background that is roughly Gaussian, we may
is above some threshold Rmax , the time step may be increased to borrow basic ideas from tempering (e.g. Kirkpatrick et al. 1983;
τ /γ to reduce computational costs. In the following examples, we Marinari & Parisi 1992; Geyer & Thompson 1995; Sambridge
use Rmin = 0.65, Rmax = 0.85 and γ = 0.80, as empirically useful 2014). Instead of considering the original generalized nullspace
values. distribution,
1 T M−1 p
ρ(p, m) = e−H (p,m) = e−U (m) e− 2 p , (33)
3.5 Loss of the Markov property we consider a tempered version,
The sequence of generalized nullspace samples produced by the − 12 pT M−1
autotuning algorithm is not an exact Markov chain where the next ρ 1/T (p, m) = e−H (p,m)/T = e−U (m)/T e T p
, (34)
model only depends on the current one. In fact, the next model with a temperature T > 1 and a tempered or macroscopic mass
depends on the current mass matrix, which is controlled by > matrix
1 previous misfit and gradient evaluations. Hence, the stochastic
1 −1
sampling process is not memoryless, as required by the detailed M−1
T = M . (35)
balance proof in Appendix B. Similar to other approximate Markov T
chain methods (e.g. Bardenet et al. 2014; Fox & Nicholls 1997; By design, tempering suppresses detail while enhancing and broad-
Korattikara et al. 2014; Scott et al. 2016), the autotuning algorithm ening macroscopic features of the distribution, as schematically
may effectively sample a different distribution, thereby introducing illustrated in Fig. 5. The macroscopic shape of the generalized
bias. nullspace may be captured using an LF-BFGS approximation of
In realistic applications, where the nullspace distribution ρ(p, m) the macroscopic Hessian, again using a small number of very
is unknown from the outset, the bias may be difficult to estimate. short chains starting from different initial models. Subsequently,
Nevertheless, the autotuning algorithm may still produce indepen- the macroscopic Hessian in LF-BFGS representation can be scaled
dent nullspace samples more efficiently than a Hamiltonian sampler back to a hopefully useful and constant mass matrix of the actual
with unit mass matrix. problem using eq. (35).

950 A. Fichtner et al.

(a) (b)

Figure 5. Schematic illustration of the effect of tempering, which transforms the multimodal distribution in (a) into the smoother, more Gaussian-like
distribution in (b).

The advantages and drawbacks of macroscopic autotuning are 4.2 Modified Styblinski–Tang function
similar to those of diagonal freezing in Section 3.6.1. Further-
Originally developed for the benchmarking of global optimiza-
more, by virtue of eq. (34), macroscopic autotuning is limited

Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
tion algorithms in the context of neural network design, the Nm -
to cases where the tolerance distribution is Gaussian. An ex-
dimensional Styblinski–Tang function (Styblinski & Tang 1990)
ample of the macroscopic autotuning approach can be found in
is non-convex and multimodal, with 2 Nm local minima. Since the
Section 4.3.
Styblinski–Tang function can take negative values, we use a mod-
ified version, described in Appendix D1, in order to define a mis-
fit χ (m). Furthermore, we introduce interparameter trade-offs. To
again make the connection to HMC, we use an Nm -dimensional
4 P E R F O R M A N C E A N A LY S I S U S I N G
Gaussian for the proposal of directional tolerances p. In this exam-
A N A LY T I C A L T E S T F U N C T I O N S
ple, we again choose Nm = 1000.
The following paragraphs are dedicated to a performance analysis As a reference, we consider a chain of 1 million samples com-
of the autotuning approach proposed in Section 3. The focus will puted with constant unit mass matrix, M = I, and constant time
be on the two main goals of this work: (1) the efficient computation step, τ . By laborious trial and error, we determined τ and the
of independent alternative models and (2) the efficient sampling of integration length T such that the effective sample fraction of the
the nullspace distribution ρ(p, m). While the former can be easily least constrained parameter, m1000 in this case, is maximized. Specif-
quantified in terms of the effective sample size, defined in (14), the ically, we found τ = 0.35 and T = 2.45 to produce a maximum
latter is more difficult because there is no universally valid quantifier effective sample fraction of 1.1 × 10−4 , as illustrated in Fig. 7(a).
of Markov chain convergence, though numerous proxies have been Small changes of τ and T may increase the effective sample frac-
proposed (e.g. Gelman & Rubin 1992; Geweke 1992; Raftery & tion slightly, but order-of-magnitude improvements are unlikely to
Lewis 1992; Cowles & Carlin 1996; Roy 2019). be possible. The small value of the effective sample fraction mostly
To quantify convergence, we conduct the performance anal- reflects the number of samples required to switch between different
ysis using analytical test functions for which lower-dimensional modes of the modified Styblinski–Tang function. This is in contrast
marginals and moments of various orders can be computed exactly. to the Gaussian, where the effective sample fraction describes the
Unavoidably, this widely-used approach to performance analysis is (in-)dependence of samples within the only existing mode.
limited by the small number of test functions that we can consider. Using the diagonal freezing variant of autotuning from Sec-
Nevertheless, it provides indications about the circumstances under tion 3.6.1, we then compute a constant diagonal mass matrix by
which the proposed algorithms are useful. averaging the diagonals of LF-BFGS Hessian approximations ob-
In all of the following examples, including those in Section 5, tained from 10 sample chains. Each of these chains starts from a
we use on average five leapfrog integration steps, meaning that the different, randomly selected initial model and only contains 200
number of misfit and gradient evaluations is around five times larger samples. The resulting effective sample fraction is 5.5 × 10−4 , that
than the number of samples. is, 5 times larger than in the most optimal case with unit mass matrix.
Hence, the sampler manages to switch into a different mode of the
distribution about every 2000 samples, instead of 10 000 samples
in the case without autotuning. The differences in effective sample
4.1 Return to the high-dimensional Gaussian fractions translate to differences in convergence. Since statistical
moments are either hard to interpret for a multimodal distribution
Starting with the simplest possible case, we return to the sampling of (e.g. means and variances) or highly susceptible to outliers (higher
the 1000-D Gaussian, previously presented as motivating example moments such as skewness or kurtosis), we consider convergence
in Section 2.4.2. Updating the initial mass matrix M = I with the au- to the exact 2-D marginal of m1 and m1000 , which we can compute
totuning procedure described in Section 3, reproduces Fig. 4 almost semi-analytically. For this, we measure the discrepancy between the
exactly. Hence, we achieve effective sample sizes as if we had used exact marginal ρ(m1 , m1000 ) and the sample-approximated marginal
M = H = C−1 from the outset. The time-step adaptivity guided by ρ̃(m 1 , m 1000 ) in terms of the Kullback–Leibler divergence or rela-
the average acceptance rate ensures that the leapfrog integration tive information content (e.g. Shannon 1948; Kullback & Leibler
remains numerically stable. This is summarized in Fig. 6. Since the 1951; Tarantola & Valette 1982; Tarantola 2005),
target distribution is Gaussian, the LF-BFGS approximation to the
Hessian eventually becomes stationary in this example, meaning
that the initially approximate Markov chain converges towards an
ρ̃(m 1 , m 1000 )
exact Markov chain. DK L = ρ̃(m 1 , m 1000 ) log10 dm 1 dm 1000 . (36)
ρ(m 1 , m 1000 )

Autotuning HMC for nullspace exploration                    951

  (a)                                                                              (b)

Figure 6. Time-step adaptivity during autotuning of the nullspace sampler. (a) Acceptance rate R averaged over the previous 20 samples. (b) Variable integration
time step τ that aims to keep R between the threshold values Rmax = 0.85 and Rmin = 0.5.

                                                                                                                                                                   Downloaded from https://academic.oup.com/gji/article/227/2/941/6321848 by guest on 28 August 2021
  (a)                                                                               (b)

Figure 7. Autocorrelations of the most constrained parameter, m1 , and the least constrained parameter, m1000 , of the modified Styblinski–Tang function,
averaged over 10 realizations of sample chains with 1 million samples each. (a) Without autotuning, autocorrelation lengths are on the order of 10 000, meaning
that around 10 000 samples are needed to switch between modes of the modified Styblinski–Tang function (eq. (D2)). The corresponding effective sample
fractions, N/Neff , are on the order of 1 × 10−4 . (b) Autocorrelations and effective sample fractions when the diagonal freezing variant of autotuning is used.
The effective sample fractions increased by a factor of about 5.

In this context, DKL can be interpreted as a loss of information (in               in Fig. D2. For the model space dimension, we again choose Nm =
digits) that results from an inaccurate approximation of the exact                 1000.
distribution. As illustrated in Fig. 8, the autotuning variant of the                 To establish a reference, we disable autotuning and repeat the
sampler approximates the exact marginal with order of magnitude                    trial-and-error search over the integration time step, τ , and
10 000 samples, assuming DKL = 0.1 as a reasonable threshold.                      the integration length, T , with the aim to maximize the effec-
Around five times more samples are needed without autotuning. We                   tive sample fraction of the least constrained parameter, m1000 .
note that other measures of convergence are, of course, possible, but              Nearly optimal values for chains with 1 million samples are τ
unlikely to change the general conclusion, given that the effect of                = 0.02 and T = 1.4, leading to low effective sample fractions
autotuning is not small.                                                           of 7.0 × 10−6 for m1 and 7.1 × 10−6 for m1000 . The corre-
                                                                                   sponding autocorrelation graphs are shown in Fig. 9(a). As for
                                                                                   the modified Styblinski–Tang function, the effective sample frac-
                                                                                   tions mostly reflect the average number of samples needed for the
4.3 Modified Rastrigin function                                                    transition between different modes of the multimodal probability
                                                                                   density.
Similar to the Styblinski–Tang function, the 2-D version of the
                                                                                      To improve convergence, we use the macroscopic autotuning
Rastrigin function was initially proposed as a performance test
                                                                                   approach presented in Section 3.6.2, using the temperature T =
function for optimization algorithms (Rastrigin 1974). Its higher-
                                                                                   100 and only 500 samples. The resulting LF-BFGS representation
dimensional generalization, proposed by Rudolph (1990), is given
                                                                                   of the mass matrix MT is then rescaled and kept constant during
by eq. (D5) in Appendix D2. Being highly oscillatory, the Rastrigin
                                                                                   the subsequent sampling of the modified Rastrigin function. The
function is non-convex and equipped with an infinite number of lo-
                                                                                   resulting autocorrelation graphs are shown in Fig. 9(b). Relative to
cal maxima. Since the Rastrigin function is positive semi-definite, it
                                                                                   the previous chain without autotuning, effective sample fractions
can be used directly as a misfit function. Yet, to mimic geophysical
                                                                                   increase by a factor of around 50, to 3.6 × 10−4 for m1 and 3.0 ×
inverse problems more closely, we introduce inter-parameter cor-
                                                                                   10−4 for m1000 .
relations and variable parameter sensitivities, as we previously did
                                                                                      The large differences of effective sample fractions translate to
for the Styblinski–Tang function. The resulting modified Rastrigin
                                                                                   differences in convergence towards the posterior distribution. As an
function is defined through eq. (D6), and some illustrations of the
                                                                                   example, we again consider the Kullback–Leibler divergence of the
function itself and its associated probability density are presented

You can also read