High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces

Page created by Susan Pham
 
CONTINUE READING
High-Dimensional Bayesian Optimization with Sparse
                           Axis-Aligned Subspaces

                                 David Eriksson* ,1                                   Martin Jankowiak∗,2

                                                 1 Facebook,  Menlo Park, California, USA
                                 2 Broad   Institute of Harvard and MIT, Cambridge, Massachusetts, USA

                               Abstract                                   These algorithms typically consist of two components. The
                                                                          first component employs Bayesian methods to construct a
                                                                          surrogate model of the (unknown) objective function. The
     Bayesian optimization (BO) is a powerful                             second component uses this model together with an acquisi-
     paradigm for efficient optimization of black-box                     tion function to select the most promising query point(s) at
     objective functions. High-dimensional BO presents                    which to evaluate the objective function. By leveraging the
     a particular challenge, in part because the curse of                 uncertainty quantification provided by the Bayesian model,
     dimensionality makes it difficult to define—as well                  a well-designed BO algorithm can provide an effective bal-
     as do inference over—a suitable class of surrogate                   ance between exploration and exploitation, leading to highly
     models. We argue that Gaussian process surrogate                     sample-efficient optimization.
     models defined on sparse axis-aligned subspaces
     offer an attractive compromise between flexibil-                     While BO has become a workhorse algorithm that is em-
     ity and parsimony. We demonstrate that our ap-                       ployed in a wide variety of settings, successful applications
     proach, which relies on Hamiltonian Monte Carlo                      are often limited to low-dimensional problems, e.g. fewer
     for inference, can rapidly identify sparse subspaces                 than twenty dimensions [Frazier, 2018]. Applying BO to
     relevant to modeling the unknown objective func-                     high-dimensional problems remains a significant challenge.
     tion, enabling sample-efficient high-dimensional                     The difficulty can be traced to both of the algorithm compo-
     BO. In an extensive suite of experiments compar-                     nents mentioned above, although we postulate that suitable
     ing to existing methods for high-dimensional BO                      function priors are especially important for good perfor-
     we demonstrate that our algorithm, Sparse Axis-                      mance. In particular, in order for BO to be sample-efficient
     Aligned Subspace BO (SAASBO), achieves excel-                        in high-dimensional spaces, it is crucial to define surrogate
     lent performance on several synthetic and real-                      models that are sufficiently parsimonious that they can be
     world problems without the need to set problem-                      inferred from a small number of query points. An overly
     specific hyperparameters.                                            flexible class of models is likely to suffer from overfitting,
                                                                          which severely limits its effectiveness in decision-making.
                                                                          Likewise, an overly rigid class of models is unlikely to
1    INTRODUCTION                                                         capture enough features of the objective function. A com-
                                                                          promise between flexibility and parsimony is essential.
Optimization plays an essential role in many fields of sci-               In this work we focus on the setting where we aim to opti-
ence, engineering and beyond. From calibrating complex                    mize a black-box function with hundreds of variables and
experimental systems to tuning hyperparameters of machine                 where we are limited to a few hundred queries of the objec-
learning models, the need for scalable and efficient optimiza-            tive function. We argue that in this low-sample regime Gaus-
tion methods is ubiquitous. Bayesian Optimization (BO)                    sian process surrogate models defined on sparse axis-aligned
algorithms have proven particularly successful on a wide                  subspaces provide an attractive compromise between flexi-
variety of domains including hyperparameter tuning [Snoek                 bility and parsimony. More specifically, our contributions
et al., 2012], A/B tests [Letham et al., 2019], chemical en-              are as follows:
gineering [Hernández-Lobato et al., 2017], materials sci-
ence [Ueno et al., 2016], control systems [Candelieri et al.,                • We propose the sparsity-inducing SAAS function prior
2018], and drug discovery [Negoescu et al., 2011].
                                                                             • We demonstrate that when combined with the No-Turn-
    * Equal   contribution                                                     U-Sampler (NUTS) for inference, our surrogate model

                             Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021).
quickly identifies the most relevant low-dimensional        mon. While most methods require thousands of evaluations
      subspace, which in turn leads to sample-efficient BO.       to find good minima [Yu and Gen, 2010], the popular co-
    • We show that SAASBO outperforms a number of strong          variance matrix adaptation evolution strategy (CMA-ES;
      baselines on several problems, including three real-        [Hansen et al., 2003]) is competitive with BO on some prob-
      world problems with as many as 388 dimensions, all          lems [Letham et al., 2020].
      without setting problem-specific hyperparameters.
                                                                  3      BACKGROUND
2    RELATED WORK
                                                                  We use this section to establish our notation and review
There is a large body of research on high-dimensional BO,         necessary background material. Throughout this paper we
and a wide variety of surrogate modelling and acquisition         work in the D-dimensional domain D = [0, 1]D . We consider
strategies have been proposed. In the following we draw           the minimization problem xmin ∈ argminx∈D fobj (x) for a
attention to a number of common themes.                           noise-free objective function fobj : D → R. We assume that
                                                                  evaluations of fobj are costly and that we are limited to
A popular approach is to rely on low-dimensional structure,       at most a few hundred. Additionally, fobj is a black-box
with several methods utilizing random projections [Wang           function and gradient information is unavailable.
et al., 2016, Qian et al., 2016, Binois et al., 2020, Letham
et al., 2020]. REMBO uses a random projection to project          The rest of this section is organized as follows: in Sec. 3.1
low-dimensional points up to the original space [Wang et al.,     we review Gaussian processes; and in Sec. 3.2 we review
2016]. ALEBO introduces several refinements to REMBO              the expected improvement acquisition function.
and demonstrates improved performance across a large num-
ber of problems [Letham et al., 2020]. Alternatively, the         3.1     GAUSSIAN PROCESSES
embedding can be learned jointly with the model, includ-
ing both linear [Garnett et al., 2014] and non-linear [Lu         Gaussian processes (GPs) offer powerful non-parametric
et al., 2018] embeddings. Finally, Hashing-enhanced Sub-          function priors that are the gold standard in BO due to their
space BO (HeSBO) [Nayebi et al., 2019] relies on hashing          flexibility and excellent uncertainty quantification. A GP
and sketching to reduce surrogate modeling and acquisition        on the input space D is specified1 by a covariance function
function optimization to a low-dimensional space.                 or kernel k : D × D → R [Rasmussen, 2003]. A common
Several methods rely on additive structure, where the func-       choice is the RBF or squared exponential kernel, which is
tion is assumed to be a sum of low-dimensional compo-             given by
nents [Kandasamy et al., 2015, Gardner et al., 2017, Mutny
and Krause, 2018, Wang et al., 2018]. This approach al-                          kψ (x, y) = σk2 exp{− 12 ∑ ρi (xi − yi )2 }      (1)
                                                                                                           i
lows separating the input space into independent domains,
reducing the effective dimensionality of the model.               where ρi for i = 1, ..., D are inverse squared length scales and
                                                                  where we use ψ to collectively denote all the hyperparam-
A common feature of many BO algorithms in high dimen-
                                                                  eters, i.e. ψ = {ρ1:D , σk2 }. For scalar regression f : D → R
sions is that they tend to prefer highly uncertain query points
                                                                  the joint density of a GP takes the form
near the domain boundary. As this is usually where the
model is the most uncertain, this is often a poor choice                         p(y, f|X) = N (y|f, σ 2 1N )N (f|00, KXX )
                                                                                                                       ψ
                                                                                                                                  (2)
that leads to over-exploration and poor optimization perfor-
mance. Oh et al. [2018] address this issue by introducing a       where y are the real-valued targets, f are the latent function
cylindrical kernel that promotes selection of query points in     values, X = {xi }Ni=1 are the N inputs with xi ∈ D, σ 2 is
                                                                  the variance of the Normal likelihood N (y|·), and KXX is
the interior of the domain. LineBO [Kirschner et al., 2019]                                                                ψ

optimizes the acquisition function along one-dimensional          the N × N kernel matrix. Throughout this paper we will be
lines, which also helps to avoid highly uncertain points. The     interested in modeling noise-free functions, in which case
TuRBO algorithm uses several trust-regions centered around        σ 2 is set to a small constant. The marginal likelihood of the
the current best solution [Eriksson et al., 2019]. These trust-   observed data can be computed in closed form:
regions are resized based on progress, allowing TuRBO to                               Z
zoom-in on promising regions. Li et al. [2017] use dropout            p(y|X, ψ) =          df p(y, f|X) = N (y, KXX + σ 2 1N ).
                                                                                                                   ψ
                                                                                                                                  (3)
to select a subset of dimensions over which to optimize the
acquisition function, with excluded dimensions fixed to the
                                                                  The posterior distribution of the GP at a query point x∗ ∈ D
value of the best point found so far.
                                                                  is the Normal distribution N (µf (x∗ ), σf (x∗ )2 ) where µf (·)
It also important to note that there are many black-box
optimization algorithms that do not rely on Bayesian meth-              1 Here
                                                                            and elsewhere we assume that the mean function is
ods, with evolutionary algorithms being especially com-           uniformly zero.
and σf (·)2 are given by                                            4.1   SAAS FUNCTION PRIOR

                          ψ T                −1
              µf (x∗ ) = k∗X (KXX + σ 2 1N ) y
                                ψ
                                                             (4)    To satisfy our desiderata we introduce a GP model with a
                 ∗ 2            ψ T               −1 ψ              structured prior over the kernel hyperparameters, in particu-
                       = k∗∗ − k∗X (KXX + σ 2 1N ) k∗X
                          ψ          ψ
             σf (x )                                         (5)
                                                                    lar one that induces sparse structure in the (inverse squared)
                                                                    length scales ρi . In detail we define the following model:
Here k∗∗ = kψ (x∗ , x∗ ) and k∗X is the column vector specified
         ψ                       ψ

by (k∗X )n = kψ (x∗ , xn ) for n = 1, ..., N.
     ψ

                                                                    [kernel variance]       σk2 ∼ L N (0, 102 )                  (8)
3.2     EXPECTED IMPROVEMENT                                        [global shrinkage]      τ ∼ H C (α)
                                                                    [length scales]         ρi ∼ H C (τ)         for i = 1, ..., D.
Expected improvement (EI) is a popular acquisition func-
                                                                                            f ∼ N (00, KXX ) with     ψ = {ρ1:d , σk2 }
                                                                                                        ψ
tion that is defined as follows [Mockus et al., 1978, Jones         [function values]
et al., 1998]. Suppose that in previous rounds of BO we             [observations]          y ∼ N (f, σ 2 1N )
have collected H = {x1:N , y1:N }. Then let ymin = minn yn
denote the best function evaluation we have seen so far. We
define the improvement u(x|ymin ) at query point x ∈ D as           where L N denotes the log-Normal distribution and
u(x|ymin ) = max(0, ymin − f (x)). EI is defined as the expec-      H C (α) denotes the half-Cauchy distribution, i.e. p(τ|α) ∝
tation of the improvement over the posterior of f (x):              (α 2 + τ 2 )−1 1(τ > 0), and p(ρi |τ) ∝ (τ 2 + ρi2 )−1 1(ρi > 0).
                                                                    Here α > 0 is a hyperparameter that controls the level of
              EI(x|ymin , ψ) = Ep(f(x)|ψ,H ) [u(x|ymin )]    (6)    shrinkage (our default is α = 0.1). We use an RBF kernel,
                                                                    although other choices like the Matérn-5/2 kernel are also
where our notation makes explicit the dependence of                 possible. We also set σ 2 → 10−6 , since we focus on noise-
Eqn. (6) on the kernel hyperparameters ψ. For a GP like in          free objective functions fobj . Noisy objective functions can
Sec. 3.1 this expectation can be evaluated in closed form:          be accommodated by placing a weak prior on σ 2 , for exam-
                                                                    ple σ 2 ∼ L N (0, 102 ).
      EI(x|ymin , ψ) = (ymin − µf (x))Φ(Z) + σf (x)φ (Z)     (7)
                                                                    The SAAS function prior defined in (8) has the following im-
                                                                    portant properties. First, the prior on the kernel variance σk2
where Z ≡ (ymin − µf (x))/σf (x) and where Φ(·) and φ (·)
                                                                    is weak (i.e. non-informative). Second, the level of global
are the CDF and PDF of the unit Normal distribution, re-
                                                                    shrinkage (i.e. sparsity) is controlled by the scalar τ > 0,
spectively. By maximizing Eqn. (7) over D we can find
                                                                    which tends to concentrate near zero due to the half-Cauchy
query points x that balance exploration and exploitation.
                                                                    prior. Third, the (inverse squared) length scales ρi are also
                                                                    governed by half-Cauchy priors, and thus they too tend to
                                                                    concentrate near zero (more precisely for most i we expect
4      BAYESIAN OPTIMIZATION WITH
                                                                    ρi . τ). Consequently most of the dimensions are ‘turned
       SPARSE AXIS-ALIGNED SUBSPACES                                off’ in accord with the principle of automatic relevance
                                                                    determination introduced by MacKay and Neal [1994]. Fi-
We now introduce the surrogate model we use for high-               nally, while the half-Cauchy priors favor values near zero,
dimensional BO. For a large number of dimensions, the               they have heavy tails. This means that if there is sufficient
space of functions mapping D to R is—to put it mildly—              evidence in the observations y, the posterior over τ will be
very large, even assuming a certain degree of smoothness.           pushed to higher values, thus reducing the level of shrinkage
To facilitate sample-efficient BO it is necessary to make           and allowing more of the ρi to escape zero, effectively ‘turn-
additional assumptions. Intuitively, we would like to as-           ing on’ more dimensions. The parsimony inherent in our
sume that the dimensions of x ∈ D exhibit a hierarchy of            function prior is thus adaptive: as more data is accumulated,
relevance. For example in a particular problem we might             more of the ρi will escape zero, and posterior mass will give
have that {x3 , x52 } are crucial features for mapping the prin-    support to a richer class of functions. This is in contrast
cipal variation of fobj , {x7 , x14 , x31 , x72 } are of moderate   to a standard GP fit with maximum likelihood estimation
importance, while the remaining features are of marginal            (MLE), which will generally exhibit non-negligible ρi for
importance. This motivates the following desiderata for our         most dimensions—since there is no mechanism regularizing
function prior:                                                     the length scales—typically resulting in drastic overfitting
    1. Assumes a hierarchy of feature relevances                    in high-dimensional settings.
    2. Encompasses a flexible class of smooth non-linear            Conceptually, our function prior describes functions defined
       functions                                                    on sparse axis-aligned subspaces, thus the name of our prior
    3. Admits tractable (approximate) inference                     (SAAS) and our method (SAASBO).
Algorithm 1: We outline the main steps in SAASBO when NUTS is used for inference. To instead use MAP we simply
    swap out line 4. For details on inference see Sec. 4.2; for details on EI maximization see Sec. 4.3.
    Input: Objective function fobj ; initial evaluation budget m ≥ 2; total evaluation budget T > m; hyperparameter α;
            number of NUTS samples L; and initial query set x1:m and evaluations y1:m (optional)
    Output: Approximate minimizer and minimum (xmin , ymin )
1   If {x1:m , y1:m } is not provided, let x1:m be a Sobol sequence in D and let yt = fobj (xt ) for t = 1, ..., m.
2   for t = m + 1, ..., T do
3       Let Ht = {x1:t−1 , y1:t−1 } and ymint = min
                                                      s
runs of L-BFGS-B to obtain the query point                        5.1                     THE SAAS PRIOR PROVIDES GOOD MODEL
                                                                                          FIT IN HIGH DIMENSIONS
              xnext = argmaxx EI(x|ymin , {ψ` })          (11)
                                                                  In Fig. 1 we demonstrate the importance of using a sparsity-
                                                                  inducing prior like SAAS when fitting a GP in a high-
See the supplementary materials for further details and
Alg. 1 for a complete outline of the SAASBO algorithm.                                       GP-MLE
                                                                                          (d = 6, D = 30)
                                                                                                                                         GP-NUTS-Dense
                                                                                                                                         (d = 6, D = 30)
                                                                                                                                                                                        GP-NUTS-SAAS
                                                                                                                                                                                        (d = 6, D = 30)
                                                                                     2                                              2                                              2

                                                                  Predicted value

                                                                                                                 Predicted value

                                                                                                                                                                Predicted value
                                                                                     0                                              0                                              0

4.4   DISCUSSION                                                                    −2                                             −2                                             −2

We note that the axis-aligned structure of our model need                           −4
                                                                                     −4     −2      0        2
                                                                                                                                   −4
                                                                                                                                    −4     −2      0        2
                                                                                                                                                                                  −4
                                                                                                                                                                                   −4     −2      0        2
                                                                                            True value                                     True value                                     True value
not be as restrictive as one might at first assume. For exam-                                 GP-MLE                                     GP-NUTS-Dense                                  GP-NUTS-SAAS
ple, suppose that fobj can be written as fobj (x) = g(x3 − x7 )                      2
                                                                                          (d = 6, D = 100)
                                                                                                                                    2
                                                                                                                                         (d = 6, D = 100)
                                                                                                                                                                                   2
                                                                                                                                                                                        (d = 6, D = 100)

for some g : R → R. In order for our model to capture the

                                                                  Predicted value

                                                                                                                 Predicted value

                                                                                                                                                                Predicted value
                                                                                     0                                              0                                              0
structure of fobj , both x3 and x7 need to be identified as
relevant. In many cases we expect this to be possible with                          −2                                             −2                                             −2
a relatively small number of samples. While it is true that
identifying the direction z = x3 − x7 could be even easier                          −4
                                                                                     −4     −2      0        2
                                                                                                                                   −4
                                                                                                                                    −4     −2      0        2
                                                                                                                                                                                  −4
                                                                                                                                                                                   −4     −2      0        2
                                                                                            True value                                     True value                                     True value
in a different coordinate system, inferring non-axis-aligned
subspaces would come at the cost of substantially increased
                                                                  Figure 1: We compare model fit for three models using the
computational cost. More importantly, by searching over
                                                                  same train/test data obtained from two independent runs of
a much larger set of subspaces our surrogate model would
                                                                  Algorithm 1 on the d = 6 Hartmann function embedded in
likely be much more susceptible to overfitting. Given that
                                                                  D ∈ {30, 100} dimensions. We compare: (left) a GP fit with
for many problems we expect much of the function variation
                                                                  MLE; (middle) a GP with weak priors fit with NUTS; and
to be captured by axis-aligned blocks of input features, we
                                                                  (right) a GP with a SAAS prior (this paper; see Eqn. (8)) fit
view our axis-aligned assumption as a good compromise
                                                                  with NUTS. In D = 30 dimensions (top row) both models
between flexibility and parsimony. Importantly, our model-
                                                                  fit with NUTS provide good fits to the test data, while MLE
ing approach does not sacrifice any of the many benefits of
                                                                  provides a bad fit near the minimum. In D = 100 dimensions
GPs (e.g. flexible non-linearity and non-parametric latent
                                                                  (bottom row) only SAAS provides a good fit. In each figure
functions) nor do we need to make any strong assumptions
                                                                  mean predictions are depicted with dots and bars denote
about fobj (e.g. additive decomposition).
                                                                  95% confidence intervals.
It is important to emphasize that it is by design that the
model defined in Sec. 4.1 does not include any discrete           dimensional domain. In D = 100 dimensions both maximum
latent variables. A natural alternative to our model would        likelihood estimation and full Bayesian inference for a GP
introduce D binary-valued variables that control whether or       with weak log-Normal priors on the (squared inverse) length
not a given dimension is relevant to modeling fobj . How-         scales ρi concentrate on solutions in which the vast majority
ever, inference in any such model is very challenging as it       of the ρi are O(1). Consequently with high probability the
requires exploring a discrete space of size 2D . Our model        kernel similarity between a randomly chosen test point and
can be understood as a continuous relaxation of such an           any of the N = 100 training data points is O(exp(−D)) ≈ 0,
approach. We discuss this point in more detail in Sec. A.3        with the result that both these models revert to a trivial
in the supplementary materials.                                   mean prediction across most of the domain. By contrast, the
                                                                  SAAS prior only allows a few ρi to escape zero, resulting
                                                                  in a model that is much more useful for exploration and
                                                                  exploitation of the most important design variables.
5     EXPERIMENTS

We present an empirical validation of our approach. In            5.2                     SAASBO CAN QUICKLY IDENTIFY THE
Sec. 5.1-5.2 we characterize the behavior of SAASBO                                       MOST RELEVANT DIMENSIONS
in controlled settings. In Sec. 5.4-5.7 we benchmark
SAASBO against a number of state-of-the-art meth-                 We characterize the behavior of SAASBO in a controlled
ods for high-dimensional BO. SAASBO will be imple-                setting where we embed the two-dimensional Branin func-
mented in BoTorch and code for reproducing the exper-             tion in D = 100 dimensions. First, we explore the degree to
iments will be available at https://github.com/                   which SAASBO’s performance depends on the approximate
martinjankowiak/saasbo.                                           inference algorithm used, in particular comparing NUTS
Branin (d = 2, D = 100)                                                             Branin (d = 2, D = 100)                                                           Branin (d = 2, D = 100)
                                 6                                                                                       2.0                                                                               4.0
                                                                        α = 0.01                                                                                                                                                              α = 0.01
                                                                                                                                                                                                           3.5

                                                                                        # of relevant dimensions found
                                                                        α = 0.1                                                                                                                                                               α = 0.1

                                                                                                                                                                            Effective subspace dimension
                                 5
                                                                        α = 1.0                                                                                                                                                               α = 1.0
                                                                                                                         1.5                                                                               3.0
                                                                        MAP                                                                                                                                                                   MAP
              Best value found
                                 4
                                                                                                                                                                                                           2.5

                                 3                                                                                       1.0                                                                               2.0

                                                                                                                                                                                                           1.5
                                 2
                                                                                                                         0.5                                α = 0.01
                                                                                                                                                            α = 0.1                                        1.0
                                 1 Global                                                                                                                   α = 1.0
                                         Minimum                                                                                                                                                           0.5
                                                                                                                                                            MAP
                                                                                                                         0.0
                                 0                                                                                                                                                                         0.0
                                     0       10       20       30       40         50                                       10    20       30          40              50                                     10    20       30          40              50
                                                  Number of evaluations                                                            Number of evaluations                                                             Number of evaluations

Figure 2: We explore how SAASBO performs on Branin (D = 100), comparing SAASBO-NUTS for three values of the
sparsity controlling hyperparameter α to SAASBO-MAP. Each curve corresponds to 60 independent replications of Algorithm
1. Left: We compare performance w.r.t. the best minimum found (the mean is depicted by a thick line and shaded bands
denote standard errors). Middle: We depict the mean number of relevant dimensions found, where a relevant dimension is
declared ‘found’ if its corresponding PosteriorMedian(ρk ) is among the two largest {PosteriorMedian(ρi )}Di=1 . Right: We
depict the mean effective subspace dimension, defined to be the number of dimensions for which PosteriorMedian(ρk ) > 0.5.

to MAP (see Sec. 4.2 for details on inference). In Fig. 2                                                                                      prior, and that all remaining hyperparameters control the
(left) we see that NUTS outperforms MAP by a consider-                                                                                         computational budget (e.g. the number of NUTS samples L).
able margin. In Fig. 2 (middle and right) we demonstrate                                                                                       This is in contrast to the many methods for high-dimensional
that both inference methods are able to reliably identify the                                                                                  BO that rely on several (potentially sensitive) hyperparame-
two relevant dimensions after ∼ 20 − 30 evaluations.                                                                                           ters such as the dimension de of a random embedding.
Why does NUTS outperform MAP even though MAP is
able to identify the relevant subspace? We hypothesize that                                                                                    5.3      BASELINES
the primary reason for the superior performance of NUTS
is that the EI objective in Eqn. (10) is considerably more                                                                                     We compare SAASBO to a comprehensive selection of base-
robust when averaged over multiple samples of the GP ker-                                                                                      lines: ALEBO, CMA-ES, EBO, HeSBO, SMAC, Sobol,
nel hyperparameters. In particular, averaging over multiple                                                                                    and TuRBO. ALEBO [Letham et al., 2020] is chosen as a
samples—potentially from distinct modes of the posterior—                                                                                      representative random embedding method, as it improves
appears to mitigate EI’s tendency to seek out the boundary                                                                                     upon the original REMBO method [Wang et al., 2016]. Ad-
of the domain D. For this reason we use NUTS for the ex-                                                                                       ditionally, we compare to HeSBO, which uses hashing and
periments in this work, noting that while we obtain good                                                                                       sketching to project low-dimensional points up to the origi-
performance with MAP in some problem settings we find                                                                                          nal space [Nayebi et al., 2019]. The EBO method by Wang
that NUTS is significantly more robust.                                                                                                        et al. [2018] exploits additive structure to scale to high-
                                                                                                                                               dimensional spaces. We also compare to CMA-ES [Hansen
Next, we explore the dependence of SAASBO-NUTS on the
                                                                                                                                               et al., 2003], which is a popular evolutionary method that
hyperparameter α. In Fig. 2 (left) we see that there is mini-
                                                                                                                                               is often competitive with BO methods on high-dimensional
mal dependence on α, with the three values leading to simi-
                                                                                                                                               problems. TuRBO [Eriksson et al., 2019] uses a trust region
lar optimization performance. In Fig. 2 (middle and right)
                                                                                                                                               centered at the best solution to avoid exploring highly uncer-
we see that, as expected, smaller values of α are more con-
                                                                                                                                               tain parts of the search space. We also include an additional
servative (i.e., prefer smaller subspaces), while larger values
                                                                                                                                               BO method that does not rely on GPs, namely SMAC [Hut-
of α are less conservative (i.e., prefer larger subspaces). We
                                                                                                                                               ter et al., 2011]. Finally, we also compare to scrambled
note, however, that this effect is most pronounced when
                                                                                                                                               Sobol sequences [Owen, 2003].
only a small number of datapoints have been collected. Af-
ter ∼ 20 function evaluations the observations overwhelm                                                                                       We use the default settings for all baselines. For ALEBO and
the prior p(τ) and the posterior quickly concentrates on the                                                                                   HeSBO we evaluate both de = 5 and de = 10 on the three
two relevant dimensions.                                                                                                                       synthetic problems in Sec. 5.4. As de = 5 does not perform
                                                                                                                                               well on the three real-world applications in Sec. 5.5-5.7, we
Given the good performance of all three values of α, for the
                                                                                                                                               instead evaluate de = 10 and de = 20 on these problems.
remainder of our experiments we choose the intermediate
value α = 0.1. While performance can perhaps be improved                                                                                       We also mention a baseline method for which we do not re-
in some cases by tuning α, we find it encouraging that we                                                                                      port results, since it underperforms random search. Namely
can get good performance with a single α. We emphasize                                                                                         for our surrogate model we use a quadratic polynomial over
that α is the only hyperparameter that governs the function                                                                                    D with O(D2 ) coefficients governed by a sparsity-inducing
Branin (d = 2, D = 100)                                      Hartmann (d = 6, D = 100)                                     Rosenbrock (d = 3, D = 100)
                              8                                                         −0.5                                                             5

                                                                                        −1.0                                                             4
                              6
           Best value found

                                                                     Best value found

                                                                                                                                      Best value found
                                                                                        −1.5                                                             3

                              4                                                         −2.0
                                                                                                                                                         2
                                                                                        −2.5
                              2                                                                                                                          1
                                                                                        −3.0
                                                                                                                                                         0
                              0                                                         −3.5
                                  0   10     20    30     40    50                             0     20    40     60    80     100                           0      10    20    30     40      50
                                       Number of evaluations                                         Number of evaluations                                          Number of evaluations
                       20                                                               −0.5
                                                                                                                                                         5
                                                                                        −1.0
                       15                                                                                                                                4
                                                                                        −1.5
       Final value

                                                                     Final value

                                                                                                                                      Final value
                                                                                                                                                         3
                       10                                                               −2.0
                                                                                                                                                         2
                                                                                        −2.5
                              5                                                                                                                          1
                                                                                        −3.0
                                                                                                                                                         0
                              0                                                         −3.5

                                               SAASBO           SMAC                                 ALEBO (de = 5)            HeSBO (de = 5)                              CMA-ES
                                               TuRBO            EBO                                  ALEBO (de = 10)           HeSBO (de = 10)                             Sobol

Figure 3: We compare SAASBO to seven baseline methods on three d−dimensional functions embedded in D = 100
dimensions. In each case we do 30 independent replications. Top row: For each method we depict the mean value of the best
minimum found at a given iteration. Bottom row: For each method we depict the distribution over the final approximate
minimum ymin encoded as a violin plot, with horizontal bars corresponding to 5%, 50%, and 95% quantiles.

Horseshoe prior [Carvalho et al., 2009]. As in Baptista and                                                     lighting a serious downside of random embedding methods.
Poloczek [2018], Oh et al. [2019], this finite feature ex-                                                      Crucially this important hyperparameter needs to be chosen
pansion admits efficient inference with a Gibbs sampler.                                                        before the start of optimization and is not learned.
Unfortunately, in our setting, where D is continuous and not
discrete, this leads to pathological behavior when combined
with EI, since the minima of simple parametric models are
very likely to be found at the boundary of D. This is in con-
trast to the mean-reverting behavior of a GP with a RBF or
Matérn kernel, which is a much more appropriate modeling
assumption in high dimensions.                                                                                  5.5    ROVER TRAJECTORY PLANNING

5.4   SYNTHETIC PROBLEMS                                                                                        We consider a variation of the rover trajectory planning prob-
                                                                                                                lem from [Wang et al., 2018] where the task is to find an
                                                                                                                optimal trajectory through a 2d-environment. In the original
In this section we consider the Branin (d = 2), Hartmann
                                                                                                                problem, the trajectory is determined by fitting a B-spline
(d = 6), and Rosenbrock (d = 3) test functions embedded
                                                                                                                to 30 waypoints and the goal is to optimize the locations
in a D = 100 space. These are problems with unambiguous
                                                                                                                of these waypoints. This is a challenging problem that re-
low-dimensional structure where we expect both random
                                                                                                                quires thousands of evaluations to find good solutions, see
embedding methods and SAASBO to perform well.
                                                                                                                e.g. [Eriksson et al., 2019]. To make the problem more
Fig. 3 shows that SAASBO and ALEBO-5 perform the best                                                           suitable for small evaluation budgets, we require that the
on Branin. SAASBO performs the best on Hartmann fol-                                                            B-spline starts and ends at the pre-determined starting posi-
lowed by ALEBO-10. HeSBO performs well on Rosenbrock                                                            tion and destination. We also increase the dimensionality to
and the final performance of SAASBO, HeSBO-5, HeSBO-                                                            D = 100 by using 50 waypoints. Fig. 4 shows that SAASBO
10, and ALEBO-5 are similar. However, both ALEBO and                                                            performs the best on this problem. This problem is chal-
HeSBO show significant sensitivity to the embedded sub-                                                         lenging for all methods, each of which had at least one
space dimension on at least two of the three problems, high-                                                    replication where the final reward was below 2.5.
Rover (D = 100)                                            SVM (D = 388)                                        Vehicle Design (D = 124)
                         −2.00                                                       0.40                                                       340
       Negative Reward   −2.25                                                       0.35                                                       320
                         −2.50

                                                                                                                                 Vehicle mass
                                                                                     0.30                                                       300

                                                                       Test RMSE
                         −2.75
                                                                                     0.25                                                       280
                         −3.00
                                                                                     0.20                                                       260
                         −3.25

                         −3.50                                                       0.15                                                       240

                         −3.75                                                       0.10                                                       220
                                 0   20      40    60       80   100                        0   20     40       60    80   100                        0     100     200     300      400
                                     Number of evaluations                                      Number of evaluations                                      Number of evaluations
                          −1.5                                                       0.45                                                       340

                          −2.0                                                       0.40                                                       320
                          −2.5                                                       0.35
                                                                                                                                                300
            Final value

                                                                       Final value

                                                                                                                                 Final value
                          −3.0                                                       0.30
                                                                                                                                                280
                          −3.5                                                       0.25
                                                                                                                                                260
                          −4.0                                                       0.20

                          −4.5                                                       0.15                                                       240

                          −5.0                                                       0.10                                                       220

                                               SAASBO            SMAC                           ALEBO (de = 10)            HeSBO (de = 10)                        CMA-ES
                                               TuRBO             EBO                            ALEBO (de = 20)            HeSBO (de = 20)                        Sobol

Figure 4: We compare SAASBO to baseline methods on rover trajectory planning (D = 100), SVM hyperparameter tuning
(D = 388), and MOPTA vehicle design (D = 124). We do 30 independent replications for Rover and SVM and 15 replications
for MOPTA. Top row: For each method we depict the mean value of the best minimum found at a given iteration. Bottom
row: For each method we depict the distribution over the final approximate minimum ymin encoded as a violin plot, with
horizontal bars corresponding to 5%, 50%, and 95% quantiles.

5.6   HYPERPARAMETER TUNING OF AN SVM                                                                       variables describe materials, gauges, and vehicle shape.
                                                                                                            To accommodate our baseline methods, While some meth-
We define a hyperparameter tuning problem using a ker-                                                      ods such as Scalable Constrained Bayesian Optimization
nel support vector machine (SVM) trained on a 385-                                                          (SCBO) [Eriksson and Poloczek, 2020] can handle this con-
dimensional regression dataset. This results in a D = 388                                                   strained problem with thousands of evaluations, we convert
problem, with 3 regularization parameters and 385 kernel                                                    the hard constraints into a soft penalty, yielding a scalar
length scales. We expect this problem to have some amount                                                   objective function. Fig. 4 shows that SAASBO outperforms
of low-dimensional structure, as we expect the regulariza-                                                  other methods by a large margin. TuRBO and CMA-ES
tion parameters to be most relevant, with a number of length                                                perform better than the remaining methods, which fail to
scales of secondary, but non-negligible importance. This                                                    identify good solutions. While this problem does not have
intuition is confirmed in Fig. 7 in the supplementary mate-                                                 obvious low-dimensional structure, our flexible SAAS prior
rials, which demonstrates that SAASBO quickly focuses on                                                    still results in superior optimization performance. In Fig. 8
the regularization parameters, explaining the superior per-                                                 in the supplementary materials we see that this good per-
formance of SAASBO seen in Fig. 4. ALEBO makes little                                                       formance can be traced to the adaptive parsimony of the
progress after iteration 30, indicating that there may not be                                               SAAS prior, which identifies small (d ∼ 2) subspaces at the
any good solutions within the random embeddings. HeSBO                                                      beginning of optimization and increasingly larger (d ∼ 10)
and EBO do better than the other methods, but fail to match                                                 subspaces towards the end.
the final performance of SAASBO.

                                                                                                            6        DISCUSSION
5.7   VEHICLE DESIGN
                                                                                                            Black-box optimization in hundreds of dimensions presents
We consider the vehicle design problem MOPTA08, a chal-                                                     a number of challenges, many of which can be traced to the
lenging real-world high-dimensional BO problem [Jones,                                                      many degrees of freedom that characterize high-dimensional
2008]. The goal is to minimize the mass of a vehicle sub-                                                   spaces. The majority of approaches to Bayesian optimiza-
ject to 68 performance constraints. The D = 124 design                                                      tion try to circumvent this potential hazard by reducing the
effective dimensionality of the problem. For example ran-          optimization via random embeddings. Journal of global
dom projection methods like ALEBO and HeSBO work                   optimization, 76(1):69–90, 2020.
directly in a low-dimensional space, while methods like
TuRBO or LineBO constrain the domain over which the              James Bradbury, Roy Frostig, Peter Hawkins,
acquisition function is optimized. We take the view that it        Matthew James Johnson, Chris Leary, Dougal Maclaurin,
is much more natural to work directly in the full space and        and Skye Wanderman-Milne.          JAX: Composable
instead rely on a sparsity-inducing function prior to mitigate     transformations of Python+NumPy programs, 2018.
the curse of dimensionality.                                       URL http://github.com/google/jax, 4:16, 2020.

As we have shown in a comprehensive set of experiments,          Antonio Candelieri, Raffaele Perego, and Francesco
SAASBO outperforms state-of-the-art BO methods on sev-            Archetti. Bayesian optimization of pump operations in
eral synthetic and real-world problems. Our approach pro-         water distribution systems. Journal of Global Optimiza-
vides several distinct advantages: we highlight three. First,      tion, 71(1):213–235, 2018.
it preserves—and therefore can exploit—structure in the in-      Carlos M Carvalho, Nicholas G Polson, and James G Scott.
put domain, in contrast to methods like ALEBO or HeSBO             Handling sparsity via the horseshoe. In Artificial Intelli-
which risk scrambling it. Second, it is adaptive and exhibits      gence and Statistics, pages 73–80. PMLR, 2009.
little sensitivity to its hyperparameters. Third, it can nat-
urally accommodate both input and output constraints, in         Dheeru Dua and Casey Graff. Uci machine learning reposi-
contrast to methods that rely on random projections, for           tory, 2017. URL: http://archive.ics.uci.edu/ml, 7(1), 2019.
which input constraints are particularly challenging.
                                                                 David Eriksson and Matthias Poloczek.      Scalable
While we have obtained strikingly good performance using a         constrained Bayesian optimization. arXiv preprint
simple acquisition strategy, it is likely that making the most     arXiv:2002.08526, 2020.
of our SAAS function prior will require a decision-theoretic
framework that is better suited to high-dimensional settings.    David Eriksson, Michael Pearce, Jacob R. Gardner, Ryan
This is an interesting direction for future elaborations of        Turner, and Matthias Poloczek. Scalable global optimiza-
SAASBO.                                                            tion via local Bayesian optimization. In Advances in
                                                                   Neural Information Processing Systems 32, pages 5497–
                                                                   5508, 2019.
Acknowledgements
                                                                 Peter I Frazier. A tutorial on Bayesian optimization. arXiv
We thank Neeraj Pradhan and Du Phan for help with                  preprint arXiv:1807.02811, 2018.
NumPyro and Maximilian Balandat for providing feedback           Jacob R. Gardner, Chuan Guo, Kilian Q. Weinberger, Ro-
on the paper.                                                      man Garnett, and Roger B. Grosse. Discovering and ex-
                                                                   ploiting additive structure for Bayesian optimization. In
References                                                         Proceedings of the 20th International Conference on Ar-
                                                                   tificial Intelligence and Statistics, volume 54 of Proceed-
Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel         ings of Machine Learning Research, pages 1311–1319.
 Daulton, Benjamin Letham, Andrew Gordon Wilson, and               PMLR, 2017.
 Eytan Bakshy. Botorch: A framework for efficient Monte-         Roman Garnett, Michael A. Osborne, and Philipp Hennig.
 Carlo Bayesian optimization. In Advances in Neural                Active learning of linear embeddings for Gaussian pro-
 Information Processing Systems 33, 2020.                          cesses. In Proceedings of the Thirtieth Conference on Un-
                                                                   certainty in Artificial Intelligence, pages 230–239. AUAI
Ricardo Baptista and Matthias Poloczek. Bayesian opti-
                                                                   Press, 2014.
  mization of combinatorial structures. volume 80 of Pro-
  ceedings of Machine Learning Research, pages 471–480.          Nikolaus Hansen, Sibylle D Müller, and Petros Koumout-
  PMLR, 2018.                                                      sakos. Reducing the time complexity of the derandom-
                                                                   ized evolution strategy with covariance matrix adaptation
Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz              (CMA-ES). Evolutionary computation, 11(1):1–18, 2003.
  Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit
  Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman.        José Miguel Hernández-Lobato, James Requeima, Ed-
  Pyro: Deep universal probabilistic programming. The              ward O. Pyzer-Knapp, and Alán Aspuru-Guzik. Parallel
  Journal of Machine Learning Research, 20(1):973–978,             and distributed Thompson sampling for large-scale accel-
  2019.                                                            erated exploration of chemical space. In Proceedings of
                                                                   the 34th International Conference on Machine Learning,
Mickaël Binois, David Ginsbourger, and Olivier Roustant.           volume 70 of Proceedings of Machine Learning Research,
 On the choice of the low-dimensional domain for global            pages 1470–1479. PMLR, 2017.
Matthew D Hoffman and Andrew Gelman. The No-U-Turn               of Machine Learning Research, pages 3273–3281. PMLR,
 sampler: Adaptively setting path lengths in Hamiltonian         2018.
 Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623,
 2014.                                                         David JC MacKay and Radford M Neal. Automatic rele-
                                                                 vance determination for neural networks. In Technical
Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Se-         Report in preparation. Cambridge University, 1994.
  quential model-based optimization for general algorithm
                                                               Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. To-
  configuration. In International conference on learning
                                                                 ward global optimization, volume 2, chapter Bayesian
  and intelligent optimization, pages 507–523. Springer,
                                                                 methods for seeking the extremum, 1978.
  2011.
                                                               Mojmir Mutny and Andreas Krause. Efficient high dimen-
Donald R Jones. Large-scale multi-disciplinary mass opti-
                                                                sional Bayesian optimization with additivity and quadra-
  mization in the auto industry. In MOPTA 2008 Confer-
                                                                ture Fourier features. In Advances in Neural Information
  ence (20 August 2008), 2008.
                                                                Processing Systems 31, pages 9019–9030, 2018.
Donald R Jones, Matthias Schonlau, and William J Welch.
                                                               Amin Nayebi, Alexander Munteanu, and Matthias Poloczek.
  Efficient global optimization of expensive black-box func-
                                                                A framework for Bayesian optimization in embedded
  tions. Journal of Global optimization, 13(4):455–492,
                                                                subspaces. In Proceedings of the 36th International Con-
 1998.
                                                                ference on Machine Learning, volume 97 of Proceedings
Kirthevasan Kandasamy, Jeff G. Schneider, and Barnabás          of Machine Learning Research, pages 4752–4761. PMLR,
  Póczos. High dimensional Bayesian optimisation and ban-       2019.
  dits via additive models. In Proceedings of the 32nd In-     Diana M Negoescu, Peter I Frazier, and Warren B Pow-
  ternational Conference on Machine Learning, volume 37          ell. The knowledge-gradient algorithm for sequencing
  of JMLR Workshop and Conference Proceedings, pages             experiments in drug discovery. INFORMS Journal on
  295–304. JMLR.org, 2015.                                       Computing, 23(3):346–363, 2011.
Diederik P. Kingma and Jimmy Ba. Adam: A method for            ChangYong Oh, Efstratios Gavves, and Max Welling.
  stochastic optimization. In 3rd International Conference       BOCK: Bayesian optimization with cylindrical kernels.
  on Learning Representations, 2015.                             In Proceedings of the 35th International Conference on
Johannes Kirschner, Mojmir Mutny, Nicole Hiller, Ras-            Machine Learning, volume 80 of Proceedings of Machine
  mus Ischebeck, and Andreas Krause. Adaptive and                Learning Research, pages 3865–3874. PMLR, 2018.
  safe bayesian optimization in high dimensions via one-       ChangYong Oh, Jakub M. Tomczak, Efstratios Gavves, and
  dimensional subspaces. In Proceedings of the 36th Inter-       Max Welling. Combinatorial Bayesian optimization us-
  national Conference on Machine Learning, volume 97             ing the graph cartesian product. In Advances in Neural
  of Proceedings of Machine Learning Research, pages             Information Processing Systems 32, pages 2910–2920,
  3429–3438. PMLR, 2019.                                         2019.
Benjamin Letham, Brian Karrer, Guilherme Ottoni, Eytan         Art B Owen. Quasi-Monte Carlo sampling. Monte Carlo
  Bakshy, et al. Constrained Bayesian optimization with          Ray Tracing: Siggraph, 1:69–88, 2003.
  noisy experiments. Bayesian Analysis, 14(2):495–519,
  2019.                                                        Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort,
                                                                 Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
Benjamin Letham, Roberto Calandra, Akshara Rai, and              Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,
  Eytan Bakshy. Re-examining linear embeddings for high-         et al. Scikit-learn: Machine learning in Python. The
  dimensional Bayesian optimization. In Advances in Neu-         Journal of machine Learning research, 12:2825–2830,
  ral Information Processing Systems 33, 2020.                   2011.
Cheng Li, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha           Du Phan, Neeraj Pradhan, and Martin Jankowiak. Compos-
  Venkatesh, and Alistair Shilton. High dimensional              able effects for flexible and accelerated probabilistic pro-
  Bayesian optimization using dropout. In Proceedings            gramming in NumPyro. arXiv preprint arXiv:1912.11554,
  of the Twenty-Sixth International Joint Conference on          2019.
  Artificial Intelligence, pages 2096–2102. ijcai.org, 2017.
                                                               Hong Qian, Yi-Qi Hu, and Yang Yu. Derivative-free opti-
Xiaoyu Lu, Javier Gonzalez, Zhenwen Dai, and Neil D.             mization of high-dimensional non-convex functions by
  Lawrence. Structured variationally auto-encoded opti-          sequential random embeddings. In Proceedings of the
  mization. In Proceedings of the 35th International Con-       Twenty-Fifth International Joint Conference on Artificial
  ference on Machine Learning, volume 80 of Proceedings         Intelligence, pages 1946–1952. IJCAI/AAAI Press, 2016.
Carl Edward Rasmussen. Gaussian processes in machine
  learning. In Summer School on Machine Learning, pages
  63–71. Springer, 2003.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practi-
  cal Bayesian optimization of machine learning algorithms.
  In Advances in Neural Information Processing Systems
  25, pages 2960–2968, 2012.
Tsuyoshi Ueno, Trevor David Rhone, Zhufeng Hou,
  Teruyasu Mizoguchi, and Koji Tsuda. COMBO: An effi-
  cient Bayesian optimization library for materials science.
  Materials discovery, 4:18–21, 2016.
Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie
  Jegelka. Batched large-scale Bayesian optimization in
  high-dimensional spaces. In International Conference
  on Artificial Intelligence and Statistics, volume 84 of
  Proceedings of Machine Learning Research, pages 745–
  754. PMLR, 2018.
Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson,
  and Nando de Feitas. Bayesian optimization in a billion
  dimensions via random embeddings. Journal of Artificial
  Intelligence Research, 55:361–387, 2016.

Xinjie Yu and Mitsuo Gen. Introduction to evolutionary
  algorithms. Springer Science & Business Media, 2010.
Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge No-
  cedal. Algorithm 778: L-BFGS-B: Fortran subroutines
  for large-scale bound-constrained optimization. ACM
  Transactions on Mathematical Software (TOMS), 23(4):
  550–560, 1997.
A     INFERENCE                                                                                              Branin (d = 2, D = 100)
                                                                                                  6
                                                                                                                               128-128-8
A.1   NUTS                                                                                        5                            512-256-16

                                                                               Best value found
We use the NUTS sampler implemented in NumPyro [Phan                                              4
et al., 2019, Bingham et al., 2019], which leverages JAX                                          3
for efficient hardware acceleration [Bradbury et al., 2020].
In most of our experiments (see Sec. E for exceptions)                                            2
we run NUTS for 768 = 512 + 256 steps where the first                                             1 Global
Nwarmup = 512 samples are for burn-in and (diagonal) mass                                                 Minimum
matrix adaptation (and thus discarded), and where we re-                                          0
                                                                                                      0      10     20    30     40         50
tain every 16th sample among the final Npost = 256 samples                                                    Number of evaluations
(i.e. sample thinning), yielding a total of L = 16 approximate
posterior samples. It is these L samples that are then used       Figure 5: We depict how SAASBO-NUTS per-
to compute Eqns. (4), (5), (10). We also limit the maximum        forms on Branin as we reduce the sampling budget
tree depth in NUTS to 6.                                          (Nwarmup , Npost , L) = (512, 256, 16) to (Nwarmup , Npost , L) =
We note that these choices are somewhat conservative, and         (128, 128, 8). We compare performance w.r.t. the best
in many settings we would expect good results with fewer          minimum found (the mean is depicted by a thick line
samples. Indeed on the Branin test function, see Fig. 5, we       and shaded bands denote standard errors). Each curve
find a relatively marginal drop in performance when we            corresponds to 60 independent replications of Algorithm 1.
reduce the NUTS sampling budget as follows: i) reduce the
number of warmup samples from 512 to 128; ii) reduce the
number of post-warmup samples from 256 to 128; and iii)           For each s = 1, ..., S we then compute the leave-one-out pre-
reduce the total number of retained samples from 16 to 8.         dictive log likelihood using the mean and variance functions
We expect broadly similar results for many other problems.        given in Eqns. (4)-(5). We then choose the value of s that
See Sec. C for corresponding runtime results.                     maximizes this predictive log likelihood and use the corre-
                                                                  sponding kernel hyperparameter ψs to compute the expected
It is worth emphasizing that while SAASBO requires speci-         improvement in Eqn. (10).
fying a few hyperparameters that control NUTS, these hy-
perparameters are purely computational in nature, i.e. they
have no effect on the SAAS function prior. Users simply           A.3   NO DISCRETE LATENT VARIABLES
choose a value of L that meets their computational budget.
This is in contrast to e.g. the embedding dimension de that       As discussed briefly in the main text, it is important that
is required by ALEBO and HeSBO: the value of de often             the SAAS prior defined in Sec. 4.1 does not include any
has significant effects on optimization performance.              discrete latent variables. Indeed a natural alternative to our
                                                                  model would introduce D binary-valued latent variables
We also note that it is possible to make SAASBO-NUTS              that control whether or not a given dimension is relevant to
faster by means of the following modifications:                   modeling fobj . However, inference in any such model can
                                                                  be very challenging, as it requires exploring an extremely
  1. Warm-start mass adaptation with mass matrices from
                                                                  large discrete space of size 2D . Our model can be under-
     previous iterations.
                                                                  stood as a continuous relaxation of such an approach. This
  2. Instead of fitting a new SAAS GP at each iteration, only     is a significant advantage since it means we can leverage
     fit every M iterations (say M = 5), and reuse hyperpa-       gradient information to efficiently explore the posterior. In-
     rameter samples {ψ` } across M iterations of SAASBO.         deed, the structure of our sparsity-inducing prior closely
                                                                  mirrors the justly famous Horseshoe prior [Carvalho et al.,
                                                                  2009], which is a popular prior for Sparse Bayesian linear
A.2   MAP                                                         regression. We note that in contrast to the linear regression
                                                                  setting of the Horseshoe prior, our sparsity-inducing prior
We run the Adam optimizer [Kingma and Ba, 2015] for
                                                                  governs inverse squared length scales in a non-linear kernel
1500 steps and with a learning rate of 0.02 and β1 = 0.50 to
                                                                  and not variances. While we expect that any prior that con-
maximize the log density
                                                                  centrates ρi at zero can exhibit good empirical performance
         Us (ψs |τs ) = log p(y|X, ψs ) + log p(ψs |τs )   (12)   in the setting of high-dimensional BO, this raises the impor-
                                                                  tant question whether distributional assumptions other than
w.r.t. ψs for S = 4 pre-selected values of τs : τs ∈              those in Eqn. (8) may be better suited to governing our prior
{1, 10−1 , 10−2 , 10−3 }. This optimization is trivially opti-    expectations about ρi . Making a careful investigation of this
mized across S.                                                   point is an interesting direction for future work.
B    EXPECTED IMPROVEMENT                                      D                       ADDITIONAL FIGURES AND
     MAXIMIZATION                                                                      EXPERIMENTS

We first form a scrambled Sobol sequence x1:Q (see             D.1                     MODEL FITTING
e.g. [Owen, 2003]) of length Q = 5000 in the D-dimensional
domain D. We then compute the expected improvement in          In Fig. 6 we reproduce the experiment described in Sec. 5.1,
Eqn. (10) in parallel for each point in the Sobol sequence.    with the difference that we replace the RBF kernel with a
We then choose the top K = 3 points in x1:Q , that yield the   Matérn-5/2 kernel.
largest EIs. For each of these K approximate maximizers we                                GP-MLE                                      GP-NUTS-Dense                                  GP-NUTS-SAAS
run L-BFGS [Zhu et al., 1997] initialized with the approx-                        2
                                                                                       (d = 6, D = 30)
                                                                                                                                 2
                                                                                                                                      (d = 6, D = 30)
                                                                                                                                                                                2
                                                                                                                                                                                     (d = 6, D = 30)

imate maximizer and using the implementation provided

                                                               Predicted value

                                                                                                              Predicted value

                                                                                                                                                             Predicted value
by Scipy (in particular fmin_l_bfgs_b) to obtain the                              0                                              0                                              0

final query point xnext , which (approximately) maximizes
                                                                                 −2                                             −2                                             −2
Eqn. (10). We limit fmin_l_bfgs_b to use a maximum
of 100 function evaluations.                                                     −4
                                                                                  −4     −2      0        2
                                                                                                                                −4
                                                                                                                                 −4     −2      0        2
                                                                                                                                                                               −4
                                                                                                                                                                                −4     −2      0        2
                                                                                         True value                                     True value                                     True value
                                                                                           GP-MLE                                     GP-NUTS-Dense                                  GP-NUTS-SAAS
                                                                                       (d = 6, D = 100)                               (d = 6, D = 100)                               (d = 6, D = 100)
                                                                                  2                                              2                                              2

                                                               Predicted value

                                                                                                              Predicted value

                                                                                                                                                             Predicted value
C    RUNTIME EXPERIMENT                                                           0                                              0                                              0

                                                                                 −2                                             −2                                             −2
We measure the runtime of SAASBO as well as each base-
line method on the Branin test problem. See Table 1 for the                      −4
                                                                                  −4     −2      0        2
                                                                                                                                −4
                                                                                                                                 −4     −2      0        2
                                                                                                                                                                               −4
                                                                                                                                                                                −4     −2      0        2
                                                                                         True value                                     True value                                     True value
results. We record runtimes for both the default SAASBO-
NUTS settings described in Sec. A.1 as well as one with a
                                                               Figure 6: This figure is an exact reproduction of Fig. 1 in the
reduced NUTS sampling budget. While SAASBO requires
                                                               main text apart from the use of a Matérn-5/2 kernel instead
                                                               of a RBF kernel. We compare model fit for three models
Table 1: Average runtime per iteration on the Branin           using the same train/test data obtained from two indepen-
test function embedded in a 100-dimensional space. Each        dent runs of Algorithm 1 on the d = 6 Hartmann function
method uses m = 10 initial points and a total of 50 function   embedded in D ∈ {30, 100} dimensions. We compare: (left)
evaluations. Runtimes are obtained using a 2.4 GHz 8-Core      a GP fit with MLE; (middle) a GP with weak priors fit with
Intel Core i9 CPU outfitted with 32 GB of RAM.                 NUTS; and (right) a GP with a SAAS prior (this paper; see
                                                               Eqn. (8)) fit with NUTS. In D = 30 dimensions (top row)
              Method              Time / iteration             all models provide good fits to the test data. In D = 100
          SAASBO (default)         26.51 seconds               dimensions (bottom row) only SAAS provides a good fit. In
         SAASBO (128-128-8)        19.21 seconds               each figure mean predictions are depicted with dots and bars
              TuRBO                1.52 seconds                denote 95% confidence intervals.
              SMAC                 12.12 seconds
               EBO                128.10 seconds
           ALEBO (de = 5)          4.34 seconds                We note that the qualitative behavior in Fig. 6 matches the
          ALEBO (de = 10)          11.91 seconds               behavior in Fig. 1. In particular, in D = 100 dimensions
           HeSBO (de = 5)          0.70 seconds                only the sparsity-inducing SAAS function prior provides
          HeSBO (de = 10)          1.51 seconds                a good fit. This emphasizes that the potential for drastic
             CMA-ES                < 0.1 seconds               overfitting that arises when fitting a non-sparse GP in high
               Sobol              < 0.01 seconds               dimensions is fundamental and is not ameliorated by using
                                                               a different kernel. In particular the fact that the Matérn-5/2
                                                               kernel decays less rapidly at large distances as compared to
more time per iteration than other methods such as TuRBO
                                                               the RBF kernel (quadratically instead of exponentially) does
and HeSBO, the overhead is relatively moderate in the set-
                                                               not prevent the non-sparse models from yielding essentially
ting where the black-box function fobj is very expensive to
                                                               trivial predictions across most of the domain D.
evaluate. We note that after reducing the NUTS sampling
budget to (Nwarmup , Npost , L) = (128, 128, 8) about 75% of
the runtime is devoted to EI optimization. Since our current   D.2                     SVM RELEVANCE PLOTS
implementation executes K = 3 runs of L-BFGS serially,
this runtime could be reduced further by executing L-BFGS      In Fig. 7 we explore the relevant subspace identified by
in parallel.                                                   SAASBO during the course of optimization of the SVM
problem discussed in Sec. 5.6. We see that the three most                                                                                                                                                                            MOPTA (D = 124)
                                                                                                                                                                                                                10
important hyperparameters, namely the regularization hyper-
parameters, are consistently found more or less immediately

                                                                                                                                                                                 Effective subspace dimension
                                                                                                                                                                                                                 8
once the initial Sobol phase of Algorithm 1 is over. This
explains the rapid early progress that SAASBO makes in                                                                                                                                                           6
Fig. 4 during optimization. We note that the 4th most rele-
vant dimension turns out to be a length scale for a patient                                                                                                                                                      4
ID feature, which makes sense given the importance of this
feature to the regression problem.                                                                                                                                                                               2                                                  ρ > 0.1

                                                                                                                                                                                                                          100       200         300                           400
                                            SVM (D = 388)                                                      SVM (D = 388)                                                                                                    Number of evaluations
                                  3.2                                                                  5
 # of relevant dimensions found

                                                                        Effective subspace dimension

                                  3.0                                                                  4
                                  2.8
                                                                                                                                                   Figure 8: We depict the effective subspace dimension dur-
                                                                                                       3                                           ing the course of a single run of Algorithm 1 on the
                                  2.6
                                                                                                       2                                           MOPTA vehicle design problem. Here the effective sub-
                                  2.4
                                                                                                                                                   space dimension is the number of dimensions for which
                                                                                                       1
                                  2.2                                                                                              ρ > 0.1
                                                                                                                                   ρ > 0.5
                                                                                                                                                   PosteriorMedian(ρk ) > ξ , with ξ = 0.1 an arbitrary cutoff.
                                  2.0                                                                  0
                                     20    40      60      80     100                                   20    40      60      80             100
                                          Number of evaluations                                              Number of evaluations
                                                                                                                                                                             SVM (D = 388)                                                                               SVM (D = 388)
                                                                                                                                                                0.40                                                                                         0.35

Figure 7: Left: We depict the mean number of regulariza-                                                                                                        0.35                                                                                         0.30
tion hyperparameters that have been ‘found’ in the SVM                                                                                                          0.30
                                                                                                                                                    Test RMSE

                                                                                                                                                                                                                                               Final value
                                                                                                                                                                                                                                                             0.25
problem, where a regularization hyperparameter is ‘found’                                                                                                       0.25
if its corresponding PosteriorMedian(ρk ) is among the three                                                                                                                                                                                                 0.20
                                                                                                                                                                0.20
largest {PosteriorMedian(ρi )}Di=1 . Note that there are three
                                                                                                                                                                0.15                                                                                         0.15
regularization hyperparameters in total. Right: We depict
the mean effective subspace dimension, defined to be the                                                                                                        0.10
                                                                                                                                                                       0    25                                       50         75       100
                                                                                                                                                                                                                                                             0.10
number of dimensions for which PosteriorMedian(ρk ) > ξ                                                                                                                    Number of evaluations
where ξ ∈ {0.1, 0.5} is an arbitrary cutoff. Means are aver-                                                                                                       SAASBO (RBF)                                                         BO-NUTS-Dense (RBF)                              Sobol
                                                                                                                                                                   SAASBO (Matérn-5/2)                                                 BO-NUTS-Dense (Matérn-5/2)
ages across 30 independent replications.

                                                                                                                                                   Figure 9: We compare the BO performance of the SAAS
                                                                                                                                                   function prior to a non-sparse function prior on the SVM
D.3                                     MOPTA08 RELEVANCE PLOTS                                                                                    hyperparameter tuning problem (D = 388). In addition we
                                                                                                                                                   compare the RBF kernel to the Matérn-5/2 kernel. We do
In Fig. 8 we see that during the course of a single run of                                                                                         15 independent replications for each method, except for
SAASBO on the MOPTA08 vehicle design problem, the                                                                                                  SAASBO-RBF and Sobol, for which we reproduce the same
effective dimension of the identified subspace steadily in-                                                                                        30 replications from the main text. Left: For each method
creases from about 2 to about 10 as more evaluations are                                                                                           we depict the mean value of the best minimimum found at a
collected. Using an increasingly flexible surrogate model                                                                                          given iteration. Right: For each method we depict the dis-
over the course of optimization is key to the excellent opti-                                                                                      tribution over the final approximate minimum ymin encoded
mization performance of SAASBO on this problem.                                                                                                    as a violin plot, with horizontal bars corresponding to 5%,
                                                                                                                                                   50%, and 95% quantiles.

D.4                                     SVM ABLATION STUDY

In Fig. 9 we depict results from an ablation study of                                                                                              D.5             ROTATED HARTMANN
SAASBO in the context of the SVM problem. First, as a
companion to Fig. 1 and Fig. 6, we compare the BO perfor-                                                                                          In this experiment we study how the axis-aligned assump-
mance of the SAAS function prior to a non-sparse function                                                                                          tion in SAAS affects the performance if we rotate the co-
prior that places weak priors on the length scales. As we                                                                                          ordinate system. In particular, we consider the Hartmann
would expect from Fig. 1 and Fig. 6, the resulting BO per-                                                                                         function fhart for d = 6 embedded in D = 100. Given a linear
formance is very poor for the non-sparse prior. Second, we                                                                                         projection dimensionality d p ≥ d, we generate a random pro-
also compare the default RBF kernel to a Matérn-5/2 kernel.                                                                                        jection Pd p ∈ Rd p ×d where [Pd p ]i j ∼ N (0, 1/d p ). The goal
We find that, at least on this problem, both kernels lead to                                                                                       is to optimize f˜(x) = fhart (Pd p x1:d p − z)) where x ∈ [0, 1]D
similar BO performance.                                                                                                                            and z ∈ Rd . Given a Pd p , z is a vector in [0, 1]d such that
f˜([x∗ ; w]) = fhart (x∗ ), ∀w ∈ [0, 1]D−d where x∗ is the global   does not work well on any problem so we instead report
optimum of the Hartmann function. The translation z guar-           results for de = 10 and de = 20.
antees that the global optimum value is attainable in the
                                                                    For CMA-ES we use the pycma4 implementation. CMA-
domain. We consider d p = 6, 18, 30 and generate a random
                                                                    ES is initialized using a random point in the domain and
Pd p and z for each embedded dimensionality that we use
                                                                    uses the default initial step-size of 0.25. Recall that the
for all replications. EBO is excluded from this study as it
                                                                    domain is normalized to [0, 1]D for all problems. We run
performed worse than Sobol in Fig. 3.
                                                                    EBO using the reference implementation by the authors5
The results are shown in Fig. 10. We see that SAASBO                with the default settings. EBO requires knowing the value
outperforms the other methods even though the function has          of the function at the global optimum. Similarly to [Letham
been rotated, which violates the axis-aligned structure. Even       et al., 2020] we provide this value to EBO for all problems,
though the function is rotated, SAASBO quickly identifies           but note that EBO still performs poorly on all problems
the most important parameters in the rotated space. We              apart from Branin and SVM.
also notice that the worst-case performance of SAASBO
                                                                    Our comparison to SMAC uses SMAC4HPO, which is im-
is better than for the other methods across all projection
                                                                    plemented in SMAC36 . On all problems we run SMAC in
dimensionalities considered.
                                                                    deterministic mode, as all problems considered in this paper
                                                                    are noise-free. For Sobol we use the SobolEngine imple-
E     ADDITIONAL EXPERIMENTAL                                       mentation in PyTorch. Finally, we compare to TuRBO with
      DETAILS                                                       a single trust region due to the limited evaluation budget;
                                                                    we use the implementation provided by the authors7 .
Apart from the experiment in Sec. 5.2 that is depicted in
Fig. 2 we use α = 0.1 in all experiments. Apart from Fig. 6         E.4   SYNTHETIC PROBLEMS
and Fig. 9, we use an RBF kernel in all experiments.
                                                                    We consider three standard synthetic functions from the
E.1   MODEL FIT EXPERIMENT                                          optimization literature. Branin is a 2-dimensional function
                                                                    that we embed in a 100-dimensional space. We consider the
In the model fit experiment in Sec. 5.1 we take data collected      standard domain [−5, 10] × [0, 15] before normalizing the
from two different runs of SAASBO in D = 100. We use one            domain to [0, 1]100 . For Hartmann, we consider the d = 6
run as training data and the second run as test data, each          version on the domain [0, 1]6 before embedding it in a 100-
with N = 100 datapoints. To construct datasets in D = 30            dimensional space. For Rosenbrock, we use d = 3 and the
dimensions we include the 6 relevant dimensions as well             domain [−2, 2]3 , which we then embed and normalize so
as 24 randomly chosen redundant dimensions and drop all             that the full domain is [0, 1]100 . Rosenbrock is a function that
remaining dimensions.                                               is challenging to model, as there are large function values
                                                                    at the boundary of the domain. For this reason all methods
                                                                    minimize log(1 + fobj (x)). All methods except for CMA-
E.2   INFERENCE AND HYPERPARAMETER                                  ES are initialized with m = 10 initial points for Branin and
      COMPARISON EXPERIMENT                                         Rosenbrock and m = 20 initial points for Hartmann.

For the experiment in Sec. 5.2 that is depicted in Fig. 2
we initialize SAASBO with m = 10 points from a Sobol                E.5   ROVER
sequence.
                                                                    We consider the rover trajectory optimization problem that
                                                                    was also considered in Wang et al. [2018]. The goal is to
E.3   BASELINES                                                     optimize the trajectory of a rover where this trajectory is
                                                                    determined by fitting a B-spline to 30 waypoints in the 2D
We compare SAASBO to ALEBO, CMA-ES, EBO, HeSBO,                     plane. While the original problem had a pre-determined
SMAC, Sobol, and TuRBO. For ALEBO and HeSBO we                      origin and destination, the resulting B-spline was not con-
use the implementations in BoTorch [Balandat et al., 2020]          strained to start and end at these positions. To make the
with the same settings that were used by [Letham et al.,            problem easier, we force the B-spline to start and end at
2020]. We consider embeddings of dimensionality de = 5              these pre-determined positions. Additionally, we use 50
and de = 10 on the synthetic problems, which is similar
to the de = d and de = 2d heuristics that were considered              4 https://github.com/CMA-ES/pycma
in [Nayebi et al., 2019] as well as [Letham et al., 2020].             5 https://github.com/zi-w/
As the true active dimensionality d of fobj is unknown, we          Ensemble-Bayesian-Optimization
do not allow any method to explicitly use this additional              6 https://github.com/automl/SMAC3

information. For the three real-world experiments, de = 5              7 https://github.com/uber-research/TuRBO
You can also read