DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization

Page created by Freddie Avila
 
CONTINUE READING
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient
                                                                       Hyperparameter Optimization∗
                                                                          Noor Awad1 , Neeratyoy Mallik1 , Frank Hutter1,2
                                                                   1
                                                                     Department of Computer Science, University of Freiburg, Germany
                                                                      2
                                                                        Bosch Center for Artificial Intelligence, Renningen, Germany
                                                                                  {awad, mallik, fh}@cs.uni-freiburg.de
arXiv:2105.09821v2 [cs.LG] 21 Oct 2021

                                                                      Abstract                             (TPE) [Bergstra et al., 2011] (in particular, strong final per-
                                                                                                           formance), and the many advantages of bandit-based HPO
                                                 Modern machine learning algorithms crucially rely         via Hyperband [Li et al., 2017]. While BOHB is among
                                                 on several design decisions to achieve strong per-        the best general-purpose HPO methods we are aware of, it
                                                 formance, making the problem of Hyperparame-              still has problems with optimizing discrete dimensions and
                                                 ter Optimization (HPO) more important than ever.          does not scale as well to high dimensions as one would wish.
                                                 Here, we combine the advantages of the popular            Therefore, it does not work well on high-dimensional HPO
                                                 bandit-based HPO method Hyperband (HB) and                problems with discrete dimensions and also has problems
                                                 the evolutionary search approach of Differential          with tabular neural architecture search (NAS) benchmarks
                                                 Evolution (DE) to yield a new HPO method which            (which can be tackled as high-dimensional discrete-valued
                                                 we call DEHB. Comprehensive results on a very             HPO benchmarks, an approach followed, e.g., by regularized
                                                 broad range of HPO problems, as well as a wide            evolution (RE) [Real et al., 2019]).
                                                 range of tabular benchmarks from neural archi-
                                                 tecture search, demonstrate that DEHB achieves               The main contribution of this paper is to further improve
                                                 strong performance far more robustly than all pre-        upon BOHB to devise an effective general HPO method,
                                                 vious HPO methods we are aware of, especially for         which we dub DEHB. DEHB is based on a combination of
                                                 high-dimensional problems with discrete input di-         the evolutionary optimization method of differential evolu-
                                                 mensions. For example, DEHB is up to 1000×                tion (DE [Storn and Price, 1997]) and Hyperband and has
                                                 faster than random search. It is also efficient in        several useful properties:
                                                 computational time, conceptually simple and easy
                                                 to implement, positioning it well to become a new          1. DEHB fulfills all the desiderata of a good HPO opti-
                                                 default HPO method.                                           mizer stated above, and in particular achieves more ro-
                                                                                                               bust strong final performance than BOHB, especially for
                                                                                                               high-dimensional and discrete-valued problems.
                                         1       Introduction
                                         Many algorithms in artificial intelligence rely crucially on       2. DEHB is conceptually simple and can thus be easily re-
                                         good settings of their hyperparameters to achieve strong per-         implemented in different frameworks.
                                         formance. This is particularly true for deep learning [Hender-
                                         son et al., 2018; Melis et al., 2018], where dozens of hyperpa-    3. DEHB is computationally cheap, not incurring the over-
                                         rameters concerning both the neural architecture and the op-          head typical of most BO methods.
                                         timization & regularization pipeline need to be instantiated.
                                         At the same time, modern neural networks continue to get           4. DEHB effectively takes advantage of parallel resources.
                                         larger and more computationally expensive, making the need
                                         for efficient hyperparameter optimization (HPO) ever more            After discussing related work (Section 2) and background
                                         important.                                                        on DE and Hyperband (Section 3), Section 4 describes our
                                            We believe that a practical, general HPO method must           new DEHB method in detail. Section 5 then presents com-
                                         fulfill many desiderata, including: (1) strong anytime per-       prehensive experiments on artificial toy functions, surrogate
                                         formance, (2) strong final performance with a large bud-          benchmarks, Bayesian neural networks, reinforcement learn-
                                         get, (3) effective use of parallel resources, (4) scalability     ing, and 13 different tabular neural architecture search bench-
                                         w.r.t. the dimensionality and (5) robustness & flexibility.       marks, demonstrating that DEHB is more effective and robust
                                         These desiderata drove the development of BOHB [Falkner           than a wide range of other HPO methods, and in particular up
                                         et al., 2018], which satisfied them by combining the best         to 1000× times faster than random search (Figure 7) and up
                                         features of Bayesian optimization via Tree Parzen estimates       to 32× times faster than BOHB (Figure 11) on HPO prob-
                                                                                                           lems; on toy functions, these speedup factors even reached
                                             ∗
                                                 Proceedings of IJCAI-21                                   33 440× and 149×, respectively (Figure 6).
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization
2     Related Work                                                     final offspring/child ui,g . The canonical DE uses a simple bi-
                                                                       nomial crossover to select values from vi,g with a probability
HPO as a black-box optimization problem can be broadly
                                                                       p (called crossover rate) and xi,g otherwise. For the members
tackled using two families of methods: model-free meth-
                                                                       xi,g+1 of the next generations, DE then uses the better of xi,g
ods, such as evolutionary algorithms, and model-based
                                                                       and ui,g . More details on DE can be found in appendix A.
Bayesian optimization methods. Evolutionary Algorithms
(EAs) are model-free population-based methods which gen-               3.2   Successive Halving (SH) and Hyperband (HB)
erally include a method of initializing a population; mutation,        Successive Halving (SH) [Jamieson and Talwalkar, 2016] is
crossover, selection operations; and a notion of fitness. EAs          a simple yet effective multi-fidelity optimization method that
are known for black-box optimization in a HPO setting since            exploits the fact that, for many problems, low-cost approxi-
the 1980s [Grefenstette, 1986]. They have also been popular            mations of the expensive blackbox functions exist, which can
for designing architectures of deep neural networks [Ange-             be used to rule out poor parts of the search space at little com-
line et al., 1994; Xie and Yuille, 2017; Real et al., 2017; Liu        putational cost. Higher-cost approximations are only used for
et al., 2017]; recently, Regularized Evolution (RE) [Real et           a small fraction of the configurations to be evaluated. Specif-
al., 2019] achieved state-of-the-art results on ImageNet.              ically, an iteration of SH starts by sampling N configurations
   Bayesian optimization (BO) uses a probabilistic model               uniformly at random, evaluating them at the lowest-cost ap-
based on the already observed data points to model the ob-             proximation (the so-called lowest budget), and forwarding a
jective function and to trade off exploration and exploita-            fraction of the top 1/η of them to the next budget (function
tion. The most commonly used probabilistic model in BO                 evaluations at which are expected to be roughly η more ex-
are Gaussian processes (GP) since they obtain well-calibrated          pensive). This process is repeated until the highest budget,
and smooth uncertainty estimates [Snoek et al., 2012]. How-            used by the expensive original blackbox function, is reached.
ever, GP-based models have high complexity, do not natively            Once the runs on the highest budget are complete, the current
scale well to high dimensions and do not apply to complex              SH iteration ends, and the next iteration starts with the lowest
spaces without prior knowledge; alternatives include tree-             budget. We call each such fixed sequence of evaluations from
based methods [Bergstra et al., 2011; Hutter et al., 2011] and         lowest to highest budget a SH bracket. While SH is often
Bayesian neural networks [Springenberg et al., 2016].                  very effective, it is not guaranteed to converge to the optimal
   Recent so-called multi-fidelity methods exploit cheap ap-           configuration even with infinite resources, because it can drop
proximations of the objective function to speed up the opti-           poorly-performing configurations at low budgets that actually
mization [Liu et al., 2016; Wang et al., 2017]. Multi-fidelity         might be the best with the highest budget.
optimization is also popular in BO, with Fabolas [Klein et                Hyperband (HB) [Li et al., 2017] solves this problem by
al., 2016] and Dragonfly [Kandasamy et al., 2020] being GP-            hedging its bets across different instantiations of SH with
based examples. The popular method BOHB [Falkner et al.,               successively larger lowest budgets, thereby being provably at
2018], which combines BO and the bandit-based approach                 most a constant times slower than random search. In partic-
Hyperband [Li et al., 2017], has been shown to be a strong             ular, this procedure also allows to find configurations that are
off-the-shelf HPO method and to the best of our knowledge              strong for higher budgets but would have been eliminated for
is the best previous off-the-shelf multi-fidelity optimizer.           lower budgets. Algorithm 2 in Appendix B shows the pseu-
                                                                       docode for HB with the SH subroutine. One iteration of HB
3     Background                                                       (also called HB bracket) can be viewed as a sequence of SH
                                                                       brackets with different starting budgets and different numbers
3.1    Differential Evolution (DE)                                     of configurations for each SH bracket. The precise budgets
In each generation g, DE uses an evolutionary search based             and number of configurations per budget are determined by
on difference vectors to generate new candidate solutions.             HB given its 3 parameters: minimum budget, maximum bud-
DE is a population-based EA which uses three basic iterative           get, and η.
steps (mutation, crossover and selection). At the beginning of            The main advantages of HB are its simplicity, theoretical
the search on a D-dimensional problem, we initialize a popu-           guarantees, and strong anytime performance compared to op-
lation of N individuals xi,g = (x1i,g , x2i,g , ..., xD
                                                      i,g ) randomly   timization methods operating on the full budget. However,
within the search range of the problem being solved. Each              HB can perform worse than BO and DE for longer runs since
individual xi,g is evaluated by computing its corresponding            it only selects configurations based on random sampling and
objective function value. Then the mutation operation gener-           does not learn from previously sampled configurations.
ates a new offspring for each individual. The canonical DE
uses a mutation strategy called rand/1, which selects three            4     DEHB
random parents xr1 , xr2 , xr3 to generate a new mutant vector         We design DEHB to satisfy all the desiderata described in
vi,g for each xi,g in the population as shown in Eq. 1 where           the introduction (Section 1). DEHB inherits several advan-
F is a scaling factor parameter and takes a value within the           tages from HB to satisfy some of these desiderata, includ-
range (0,1].                                                           ing its strong anytime performance, scalability and flexibil-
              vi,g = xr1 ,g + F · (xr2 ,g − xr3 ,g ).           (1)    ity. From the DE component, it inherits robustness, simplic-
                                                                       ity, and computational efficiency. We explain DEHB in detail
   The crossover operation then combines each individual               in the remainder of this section; full pseudocode can be found
xi,g and its corresponding mutant vector vi,g to generate the          in Algorithm 3 in Appendix C.
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization
Figure 1: Internals of a DEHB iteration showing information flow
across fidelities (top-down), and how each subpopulation is updated
in each DEHB iteration (left-right).                                             Figure 2: Modified SH routine under DEHB

4.1   High-Level Overview                                             2. Specifically, these will used as the so-called parent pool
A key design principle of DEHB is to share information                for that higher budget, using the modified DE evolution to be
across the runs it executes at various budgets. DEHB main-            discussed in Section 4.2. The end of SH Bracket 4 marks the
tains a subpopulation for each of the budget levels, where            end of this DEHB iteration. We dub DEHB’s first iteration
the population size for each subpopulation is assigned as the         its initialization iteration. At the end of this iteration, all DE
maximum number of function evaluations HB allocates for               subpopulations associated with the higher budgets are seeded
the corresponding budget.                                             with configurations that performed well in the lower budgets.
   We borrow nomenclature from HB and call the HB itera-              In subsequent SH brackets, no random sampling occurs any-
tions that DEHB uses DEHB iterations. Figure 1 illustrates            more, and the search runs separate DE evolutions at different
one such iteration, where minimum budget, maximum bud-                budget levels, where information flows from the subpopula-
get, and η are 1, 27, and 3, respectively. The topmost sphere         tions at lower budgets to those at higher budgets through the
for SH Bracket 1, is the first step, where 27 configurations          modified DE mutation (Fig. 3).
are sampled uniformly at random and evaluated at the lowest
budget 1. These evaluated configurations now form the DE              4.2   Modified Successive Halving using DE
subpopulation associated with budget 1. The dotted arrow                    Evolution
pointing downwards indicates that the top-9 configurations            We now discuss the deviations from vanilla SH by elaborat-
(27/η) are promoted to be evaluated on the next higher bud-           ing on the design of a SH bracket inside DEHB, highlighted
get 3 to create the DE subpopulation associated with budget 3,        with a box in Figure 1 (SH Bracket 1). In DEHB, the top-
and so on until the highest budget. This progressive increase         performing configurations from a lower budget are not sim-
of the budget by η and decrease of the number of configura-           ply promoted and evaluated on a higher budget (except for
tions evaluated by η is simply the vanilla SH. Indeed, each           the Initialization SH bracket). Rather, in DEHB, the top-
SH bracket for this first DEHB iteration is basically execut-         performing configurations are collected in a Parent Pool (Fig-
ing vanilla SH, starting from different minimum budgets, just         ure 2). This pool is responsible for transfer of information
like in HB.                                                           from a lower budget to the next higher budget, but not by di-
   One difference from vanilla SH is that random sampling of          rectly suggesting best configurations from the lower budget
configurations occurs only once: in the first step of the first       for re-evaluation at a higher budget. Instead, the parent pool
SH bracket of the first DEHB iteration. Every subsequent              represents a good performing region w.r.t. the lower budget,
SH bracket begins by reusing the subpopulation updated in             from which parents can be sampled for mutation. Figure 3b
the previous SH bracket, and carrying out a DE evolution              demonstrates how a parent pool contributes in a DE evolu-
(detailed in Section 4.2). For example, for SH bracket 2 in           tion in DEHB. Unlike in vanilla DE (Figure 3a), in DEHB,
Figure 1, the subpopulation of 9 configurations for budget 3          the mutants involved in DE evolution are extracted from the
(topmost sphere) is propagated from SH bracket 1 and un-              parent pool instead of the population itself. This allows the
dergoes evolution. The top 3 configurations (9/η) then affect         evolution to incorporate and combine information from the
the population for the next higher budget 9 of SH bracket             current budget, and also from the decoupled search happen-
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization
Figure 5:      Results for the
                                                                                                         OpenML Letter surrogate
                                                                                                         benchmark where n represents
                                                                                                         number of workers that were
                                                                                                         used for each DEHB run. Each
                                                                                                         trace is averaged over 10 runs.

                                                                    ble cost for function evaluations, DEHB is almost 2 orders of
                                                                    magnitude faster than BOHB to perform 13336 function eval-
                                                                    uations. GP-based Bayesian optimization tools would require
                                                                    approximations to even fit a single model with this number of
                                                                    function evaluations.
                                                                       We also briefly describe a parallel version of DEHB (see
                                                                    Appendix C.3 for details of its design). Since DEHB can be
         Figure 3: Modified DE evolution under DEHB                 viewed as a sequence of predetermined SH brackets, the SH
                                                                    brackets can be asynchronously distributed over free workers.
                                  Figure 4: Runtime comparison      A central DEHB Orchestrator keeps a single copy of all DE
                                  for DEHB and BOHB based           subpopulations, allowing for asynchronous, immediate DE
                                  on a single run on the Cifar-10   evolution updates. Figure 5 illustrates that this parallel ver-
                                  benchmark from NAS-Bench-
                                  201. The x-axis shows the ac-
                                                                    sion achieves linear speedups for similar final performance.
                                  tual cumulative wall-clock time
                                  spent by the algorithm (opti-     5       Experiments
                                  mization time) in between the
                                  function evaluations.
                                                                    We now comprehensively evaluate DEHB, illustrating that it
                                                                    is more robust and efficient than any other HPO method we
                                                                    are aware of. To keep comparisons fair and reproducible, we
ing on the lower budget. The selection step as shown in Fig-        use a broad collection of publicly-available HPO and NAS
ure 3 is responsible for updating the current subpopulation if      benchmarks: all HPO benchmarks that were used to demon-
the new suggested configuration is better. If not, the existing     strate the strength of BOHB [Falkner et al., 2018]1 and also a
configuration is retained in the subpopulation. This guards         broad collection of 13 recent tabular NAS benchmarks repre-
against cases where performance across budget levels is not         sented as HPO problems [Awad et al., 2020].
correlated and good configurations from lower budgets do not           In this section, to avoid cluttered plots we present a focused
improve higher budget scores. However, search on the higher         comparison of DEHB with BOHB, the best previous off-
budget can still progress, as the first step of every SH bracket    the-shelf multi-fidelity HPO method we are aware of, which
performs vanilla DE evolution (there is no parent pool to re-       has in turn outperformed a broad range of competitors (GP-
ceive information from). Thereby, search at the required bud-       BO, TPE, SMAC, HB, Fabolas, MTBO, and HB-LCNet) on
get level progresses even if lower budgets are not informative.     these benchmarks [Falkner et al., 2018]. For reference, we
   Additionally, we also construct a global population pool         also include the obligatory random search (RS) baseline in
consisting of configurations from all the subpopulations. This      these plots, showing it to be clearly dominated, with up to
pool does not undergo any evolution and serves as the parent        1000-fold speedups. We also provide a comparison against a
pool in the edge case where the parent pool is smaller than         broader range of methods at the end of this section (see Figure
the minimum number of individuals required for the mutation         13 and Table 1), with a full comparison in Appendix D. We
step. For the example in Figure 2, under the rand1 mutation         also compare to the recent GP-based multi-fidelity BO tool
strategy (which requires three parents), we see that for the        Dragonfly in Appendix D.7. Details for the hyperparameter
highest budget, only one configuration (3/η) is included from       values of the used algorithms can be found in Appendix D.1.
the previous budget. In such a scenario, the additional two            We use the same parameter settings for mutation factor
required parents are sampled from the global population pool.       F = 0.5 and crossover rate p = 0.5 for both DE and DEHB.
                                                                    The population size for DEHB is not user-defined but set by
4.3   DEHB efficiency and parallelization                           its internal Hyperband component while we set it to 20 for DE
                                                                    following [Awad et al., 2020]. Unless specified otherwise,
As mentioned previously, DEHB carries out separate DE               we report results from 50 runs for all algorithms, plotting the
searches at each budget level. Moreover, the DE operations          validation regret2 over the cumulative cost incurred by the
involved in evolving a configuration are constant in opera-         function evaluations, and ignoring the optimizers’ overhead
tion and time. Therefore, DEHB’s runtime overhead does              in order to not give DEHB what could be seen as an unfair
not grow over time, even as the number of performed func-
tion evaluations increases; this is in stark contrast to model-         1
                                                                          We leave out the 2-dimensionsal SVM surrogate benchmarks
based methods, whose time complexity is often cubic in the          since all multi-fidelity algorithms performed similarly for this easy
number of performed function evaluations. Indeed, Figure            task, without any discernible difference.
                                                                        2
4 demonstrates that, for a tabular benchmark with negligi-                This is the difference of validation score from the global best.
Figure 6:     Results for the                                       Figure 7: Results for the
                                       Stochastic Counting Ones prob-                                      OpenML Adult surrogate
                                       lem in 64 dimensional space                                         benchmark for 6 continuous
                                       with 32 categorical and 32 con-                                     hyperparameters for 50 runs of
                                       tinuous hyperparameters. All                                        each algorithm.
                                       algorithms shown were run for
                                       50 runs.

advantage.3 We also show the speedups that DEHB achieves                                                   Figure 8: Results for tuning 5
                                                                                                           hyperparameters of a Bayesian
compared to RS and BOHB, where this is possible without                                                    Neural Network on the Boston
adding clutter.                                                                                            Housing regression dataset for
                                                                                                           50 runs each.
5.1      Artificial Toy Function: Stochastic Counting
         Ones
This toy benchmark by Falkner et al. [2018] is useful to assess
scaling behavior and ability to handle binary dimensions. The
goal is to minimize the following objective function:                    over random search; qualitatitvely similar results for the other
                                                     !                   5 datasets are in Appendix D.3.
                     X            X
        f (x) = −          x+            Eb [(Bp=x )] ,                  5.3   Bayesian Neural Networks
                        x∈Xcat       x∈Xcont
                                                                         In this benchmark, introduced by Falkner et al. [2018], a two-
where the sum of the categorical variables (xi ∈ {0, 1}) rep-            layer fully-connected Bayesian Neural Network is trained us-
resents the standard discrete counting ones problem. The                 ing stochastic gradient Hamiltonian Monte-Carlo sampling
continuous variables (xj ∈ [0, 1]) represent the stochastic              (SGHMC) [Chen et al., 2014] with scale adaptation [Sprin-
component, with the budget b controlling the noise. The                  genberg et al., 2016]. The budgets were the number of
budget here represents the number of samples used to es-                 MCMC steps (500 as minimum; 10000 as maximum). Two
timate the mean of the Bernoulli distribution (B) with pa-               regression datasets from UCI [Dua and Graff, 2017] were
rameters xj . Following Falkner et al. [2018], we run 4 sets             used for the experiments: Boston Housing and Protein Struc-
of experiments with Ncont = Ncat = {4, 8, 16, 32}, where                 ture. Figure 8 shows the results (for Boston housing; the re-
Ncont = |Xcont | and Ncat = |Xcat |, using the same bud-                 sults for Protein Structure are in Appendix D.4). For this
get spacing and plotting the normalized regret: (f (x) + d)/d,           extremely noisy benchmark, BOHB and DEHB perform sim-
where d = Ncat + Ncont . Although this is a toy benchmark it             ilarly, and both are about 2× faster than RS.
can offer interesting insights since the search space has mixed
binary/continuous dimensions which DEHB handles well (re-                5.4   Reinforcement Learning
fer to C.2 in Appendix for more details). In Figure 6, we con-           For this benchmark used by Falkner et al. [2018]), a proxi-
sider the 64-dimensional space Ncat = Ncont = 32; results                mal policy optimization (PPO) [Schulman et al., 2017] im-
for the lower dimensions can be found in Appendix D.2. Both              plementation is parameterized with 7 hyperparameters. PPO
BOHB and DEHB begin with a set of randomly sampled indi-                 is used to learn the cartpole swing-up task from the OpenAI
viduals evaluated on the lowest budget. It is therefore unsur-           Gym [Brockman et al., 2016] environment. We plot the mean
prising that in Figure 6 (and in other experiments too), these           number of episodes needed until convergence for a configura-
two algorithms follow a similar optimization trace at the be-            tion over actual cumulative wall-clock time in Figure 9. De-
ginning of the search. Given the high dimensionality, BOHB               spite the strong noise in this problem, BOHB and DEHB are
requires many more samples to switch to model-based search               able to improve continuously, showing similar performance,
which slows its convergence in comparison to the lower di-               and speeding up over random search by roughly 2×.
mensional cases (Ncont = Ncat = {4, 8, 16}). In contrast,
DEHB’s convergence rate is almost agnostic to the increase               5.5   NAS Benchmarks
in dimensionality.
                                                                         In this series of experiments, we evaluate DEHB on a broad
5.2      Surrogates for Feedforward Neural Networks                      range of NAS benchmarks. We use a total of 13 tabular
                                                                         benchmarks from NAS-Bench-101 [Ying et al., 2019], NAS-
In this experiment, we optimize six architectural and training
                                                                         Bench-1shot1 [Zela et al., 2020], NAS-Bench-201 [Dong
hyperparameters of a feed-forward neural network on six dif-
                                                                         and Yang, 2020] and NAS-HPO-Bench [Klein and Hutter,
ferent datasets from OpenML [Vanschoren et al., 2014], us-
ing a surrogate benchmark built by Falkner et al. [2018]. The
budgets are the training epochs for the neural networks. For                                               Figure 9: Results for tuning
all six datasets, we observe a similar pattern of the search tra-                                          PPO on OpenAI Gym cartpole
jectory, with DEHB and BOHB having similar anytime per-                                                    environment with 7 hyperpa-
                                                                                                           rameters. Each algorithm was
formance and DEHB achieving the best final score. An ex-                                                   run for 50 runs.
ample is given in Figure 7, also showing a 1000-fold speedup
   3
       Shaded bands in plots represent the standard error of the mean.
Figure 10: Results for Cifar C
                                    from NAS-Bench-101 for a 27-
                                    dimensional space — 22 con-
                                    tinuous + 5 categorical hyper-
                                    parameters)

                                    Figure 11:      Results for
                                    ImageNet16-120 from NAS-
                                    Bench-201 for 50 runs of
                                    each algorithm. The search
                                    space contains 6 categorical
                                    parameters.

2019]. For NAS-Bench-101, we show results on CifarC (a                Figure 13: Average rank of the mean validation regret of 50 runs of
mixed data type encoding of the parameter space [Awad et              each algorithm, averaged over the NAS-Bench-101, NAS-Bench-
al., 2020]) in Figure 10; BOHB and DEHB initially perform             1shot1, NAS-HPO-Bench, NAS-Bench-201, OpenML surrogates,
similarly as RS for this dataset, since there is only little corre-   and the Reinforcement Learning benchmarks.
lation between runs with few epochs (low budgets) and many
epochs (high budgets) in NAS-Bench-101. In the end, RS                budgets. In Table 1, we show the average rank of each algo-
stagnates, BOHB stagnates at a slightly better performance,           rithm based on the final validation regret achieved across all
and DEHB continues to improve. In Figure 11, we report re-            benchmarks (now also including Stochastic Counting Ones
sults for ImageNet16-120 from NAS-201. In this case, DEHB             and Bayesian Neural Networks; data derived from Table 2
is clearly the best of the methods, quickly converging to a           in Appendix D.8). Next to its strong anytime performance,
strong solution.                                                      DEHB also yields the best final performance in this compar-
   Finally, Figure 12 reports results for the Protein Struc-          ison, thus emerging as a strong general optimizer that works
ture dataset provided in NAS-HPO-Bench. DEHB makes                    consistently across a diverse set of benchmarks. Result tables
progress faster than BOHB to reach the optimum. The results           and figures for all benchmarks can be found in Appendix D.
on other NAS benchmarks are qualitatively similar to these 3
representative benchmarks, and are given in Appendix D.6.                                RS   HB BOHB TPE SMAC RE            DE DEHB
                                                                             Avg. rank 7.46 6.54   4.42   4.35   4.73   3.16 2.96   2.39
5.6   Results summary
                                                                      Table 1: Mean ranks based on final mean validation regret for all
We now compare DEHB to a broader range of baseline al-                algorithms tested for all benchmarks.
gorithms, also including HB, TPE [Bergstra et al., 2011],
SMAC [Hutter et al., 2011], regularized evolution (RE) [Real
et al., 2019], and DE. Based on the mean validation regret, all
algorithms can be ranked for each benchmark, for every sec-           6   Conclusion
ond of the estimated wallclock time. Arranging the mean re-           We introduced DEHB, a new, general HPO solver, built to
gret per timepoint across all benchmarks (except the Stochas-         perform efficiently and robustly across many different prob-
tic Counting Ones and the Bayesian Neural Network bench-              lem domains. As discussed, DEHB satisfies the many re-
marks, which do not have runtimes as budgets), we compute             quirements of such an HPO solver: strong performance with
the average relative rank over time for each algorithm in Fig-        both short and long compute budgets, robust results, scal-
ure 13, where all 8 algorithms were given the mean rank of            ability to high dimensions, flexibility to handle mixed data
4.5 at the beginning. The shaded region clearly indicates that        types, parallelizability, and low computational overhead. Our
DEHB is the most robust algorithm for this set of bench-              experiments show that DEHB meets these requirements and
marks (discussed further in Appendix D.8). In the end, RE             in particular yields much more robust performance for dis-
and DE are similarly good, but these blackbox optimization            crete and high-dimensional problems than BOHB, the previ-
algorithms perform worst for small compute budgets, while             ous best overall HPO method we are aware of. Indeed, in
DEHB’s multi-fidelity aspect makes it robust across compute           our experiments, DEHB was up to 32× faster than BOHB
                                                                      and up to 1000× faster than random search. DEHB does
                                    Figure 12: Results for the        not require advanced software packages, is simple by de-
                                    Protein Structure dataset from    sign, and can easily be implemented across various platforms
                                    NAS-HPO-Bench for 50 runs         and languages, allowing for practical adoption. We thus
                                    of each algorithm. The search     hope that DEHB will become a new default HPO method.
                                    space contains 9 hyperparame-     Our reference implementation of DEHB is available at https:
                                    ters.                             //github.com/automl/DEHB.
Acknowledgements. The authors acknowledge funding by             K. Kandasamy, K. R. Vysyaraju, W. Neiswanger, B. Paria,
the Robert Bosch GmbH, by the German Federal Ministry              C. R. Collins, J. Schneider, B. Poczos, and E. P. Xing.
of Education and Research (BMBF, grant Renormalized-               Tuning hyperparameters without grad students: Scalable
Flows 01IS19077C), and support by the state of Baden-              and robust bayesian optimisation with dragonfly. Journal
Württemberg through bwHPC and the German Research                 of Machine Learning Research, 21(81):1–27, 2020.
Foundation (DFG) through grant no INST 39/963-1 FUGG.            A. Klein and F. Hutter. Tabular benchmarks for joint archi-
                                                                   tecture and hyperparameter optimization. arXiv preprint
References                                                         arXiv:1905.04970, 2019.
P.J. Angeline, G.M. Saunders, and J.B. Pollack. An evo-          A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast
   lutionary algorithm that constructs recurrent neural net-       bayesian optimization of machine learning hyperparame-
   works. IEEE transactions on Neural Networks, 5(1):54–           ters on large datasets. arXiv:1605.07079 [cs.LG], 2016.
   65, 1994.
                                                                 L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and
N. Awad, N. Mallik, and F. Hutter. Differential evolution
                                                                   A. Talwalkar. Hyperband: Bandit-based configuration
   for neural architecture search. In First ICLR Workshop on
                                                                   evaluation for hyperparameter optimization. In Proc. of
   Neural Architecture Search, 2020.
                                                                   ICLR’17, 2017.
J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms
   for hyper-parameter optimization. In Proc. of NeurIPS’11,     B. Liu, S. Koziel, and Q. Zhang. A multi-fidelity surrogate-
   pages 2546–2554, 2011.                                          model-assisted evolutionary algorithm for computationally
                                                                   expensive optimization problems. Journal of computa-
G. Brockman, V. Cheung, L. Pettersson, J. Schneider,               tional science, 12:28–37, 2016.
   J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv
   preprint arXiv:1606.01540, 2016.                              H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and
                                                                   K. Kavukcuoglu. Hierarchical representations for efficient
U. K. Chakraborty. Advances in differential evolution, vol-        architecture search. arXiv preprint arXiv:1711.00436,
   ume 143. Springer, 2008.                                        2017.
T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamil-
                                                                 H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable ar-
   tonian monte carlo. In International conference on ma-
                                                                   chitecture search. arXiv preprint arXiv:1806.09055, 2018.
   chine learning, pages 1683–1691, 2014.
S. Das, S. S. Mullick, and P. N. Suganthan. Recent advances      G. Melis, C. Dyer, and P. Blunsom. On the state of the art of
   in differential evolution–an updated survey. Swarm and          evaluation in neural language models. In Proc. of ICLR’18,
   Evolutionary Computation, 27:1–30, 2016.                        2018.
X. Dong and Y. Yang. Nas-bench-102: Extending the scope          H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef-
   of reproducible neural architecture search. arXiv preprint      ficient neural architecture search via parameter sharing.
   arXiv:2001.00326, 2020.                                         arXiv preprint arXiv:1802.03268, 2018.
D. Dua and C. Graff. Uci machine learning repository, 2017.      E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan,
                                                                    Q. V. Le, and A. Kurakin. Large-scale evolution of image
S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and ef-           classifiers. In Proc. of ICML, pages 2902–2911. JMLR.
   ficient hyperparameter optimization at scale. In Proc. of        org, 2017.
   ICML’18, pages 1437–1446, 2018.
                                                                 E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regular-
J. J. Grefenstette. Optimization of control parameters for ge-
                                                                   ized evolution for image classifier architecture search. In
   netic algorithms. IEEE Transactions on Systems, Man, and
                                                                   Proc. of AAAI, volume 33, pages 4780–4789, 2019.
   Cybernetics, 16:341–359, 1986.
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,        J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
   and D. Meger. Deep reinforcement learning that matters.          O. Klimov. Proximal policy optimization algorithms. arXiv
   In Proc. of AAAI’18, 2018.                                       preprint arXiv:1707.06347, 2017.
F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-       J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian
   based optimization for general algorithm configuration. In       optimization of machine learning algorithms. In Proc. of
   Proc. of LION’11, pages 507–523, 2011.                           NeurIPS’12, pages 2951–2959, 2012.
K. Jamieson and A. Talwalkar. Non-stochastic best arm iden-      J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter.
   tification and hyperparameter optimization. In Proc. of          Bayesian optimization with robust bayesian neural net-
   AISTATS’16, 2016.                                                works. In Proc. of NeurIPS, pages 4134–4142, 2016.
R. M. Storn K. Price and J. A. Lampinen. Differential evolu-     R. Storn and K. Price. Differential evolution–a simple and
   tion: a practical approach to global optimization. Springer     efficient heuristic for global optimization over continuous
   Science & Business Media, 2006.                                 spaces. Journal of global optimization, 11(4):341–359,
K. Kandasamy, G. Dasarathy, J. Schneider, and B. Póczos.          1997.
   Multi-fidelity bayesian optimisation with continuous ap-      M. Vallati, F. Hutter, Lukás L. Chrpa, and T. L. McCluskey.
   proximations. arXiv:1703.06240 [stat.ML], 2017.                On the effective configuration of planning domain models.
In Twenty-Fourth International Joint Conference on Artifi-
  cial Intelligence, 2015.
J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo.
   Openml: networked science in machine learning. ACM
   SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
H. Wang, Y. Jin, and J. Doherty. A generic test suite for evo-
  lutionary multifidelity optimization. IEEE Transactions on
  Evolutionary Computation, 22(6):836–850, 2017.
L. Xie and A. Yuille. Genetic cnn. In Proc. of ICCV, pages
  1379–1388, 2017.
C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and
  F. Hutter. Nas-bench-101: Towards reproducible neural ar-
  chitecture search. arXiv preprint arXiv:1902.09635, 2019.
A. Zela, J. Siems, and F. Hutter. Nas-bench-1shot1: Bench-
  marking and dissecting one-shot neural architecture search.
  arXiv preprint arXiv:2001.10422, 2020.
A        More details on DE                                                         X2
Differential Evolution (DE) is a simple, well-performing evo-                                   Xr2 - Xr3        Global Optimum
lutionary algorithm to solve a variety of optimization prob-
lems [K. Price and Lampinen, 2006] [Das et al., 2016]. This                               Xr2
algorithm was originally introduced in 1995 by Storn and                                                                 V = Xr1 + F.(Xr2 - Xr3 )
                                                                                                  Xr3
Price [Storn and Price, 1997], and later attracted the attention
of many researchers to propose new improved state-of-the-art
                                                                                                                                F.(Xr2 - Xr3 )
algorithms [Chakraborty, 2008]. DE is based on four steps:
initialization, mutation, crossover and selection. Algorithm 1
presents the DE pseudo-code.
   Initialization. DE is a population-based meta-heuristic                                                        Xr1
algorithm which consists of a population of N individuals.
Each individual is considered a solution and expressed as a
vector of D-dimensional decision variables as follows:                                                                                X1
                                                                                                        Mutation
             popg = (x1i,g , x2i,g , ..., xD
                                           i,g ), i = 1, 2, ..., N,   (2)   Figure 14: Illustration of DE Mutation operation for a 2-dimensional
                                                                            case using the rand/1 mutation strategy. The scaled difference vector
where g is the generation number, D is the dimension of the                 (F.(xr2 − xr3 )) is used to determine the neighbourhood of search
problem being solved and N is the population size. The al-                  from xr1 . Depending on the diversity of the population, DE muta-
                                                                            tion’s search will be explorative or exploitative
gorithm starts initially with randomly distributed individuals
within the search space. The function value for the problem
being solved is then computed for each individual, f (x).                   random number generated for all dimensions is >p. Figure
   Mutation. A new child/offspring is generated using the                   15 shows an illustration of the crossover operations.
mutation operation for each individual in the population by                    Selection. After the final offspring is generated, the se-
a so called mutation strategy. Figure 14 illustrates this op-               lection operation takes place to determine whether the target
eration for a 2-dimensional case. The classical DE uses the                 (the parent, xi,g ) or the trial (the offspring, ui,g ) vector sur-
mutation operator rand/1, in which three random individu-                   vives to the next generation by comparing the function values.
als/parents denoted as xr1 , xr2 , xr3 are chosen to generate a             The offspring replaces its parents if it has a better5 function
new vector vi as follows:                                                   value as shown in Equation 5. Otherwise, the new offspring
                                                                            is discarded, and the target vector remains in the population
                  vi,g = xr1 ,g + F · (xr2 ,g − xr3 ,g ),             (3)   for the next generation.
where vi,g is the mutant vector generated for each individual
                                                                                                         ui,g   if (f (ui,g ) ≤ f (xi,g ))
                                                                                                     
xi,g in the population. F is the scaling factor that usually
takes values within the range (0, 1] and r1 , r2 , r3 are the in-                           xi,g =                                                  (5)
                                                                                                         xi,g   otherwise
dices of different randomly-selected individuals. Eq.3 allows
some parameters to be outside the search range, therefore,
each parameter in vi,g is checked and reset4 if it happens to               B       More details on Hyberband
be outside the boundaries.
                                                                            The Hyperband [Li et al., 2017] (HB) algorithm was de-
   Crossover. When the mutation phase is completed, the
                                                                            signed to perform random sampling with early stopping based
crossover operation is applied to each target vector xi,g and
                                                                            on pre-determined geometrically spaced resource allocation.
its corresponding mutant vector vi,g to generate a trial vec-
                                                                            For DEHB we replace the random sampling with DE search.
tor ui,g . Classical DE uses the following uniform (binomial)
                                                                            However, DEHB uses HB at its core to solve the “n versus
crossover:
                                                                            B/n” tradeoff that HB was designed to address. Algorithm
                 ( j                                                        2 shows how DEHB interfaces HB to query the sequence of
           j
                   vi,g if (rand ≤ p) or (j = jrand )
         ui,g =                                               (4)           how many configurations of each budget to run at each itera-
                   xji,g otherwise                                          tion. This view treats the DEHB algorithm as a sequence of
                                                                            predetermined (by HB), repeating Successive Halving brack-
   The crossover rate p is real-valued and is usually specified             ets where, iteration number refers to the index of SH brackets
in the range [0, 1]. This variable controls the portion of pa-              run by DEHB.
rameter values that are copied from the mutant vector. The
jth parameter value is copied from the mutant vector vi,g to
the corresponding position in the trial vector ui,g if a random             C        More details on DEHB
number is less than or equal to p. If the condition is not satis-           C.1          DEHB algorithm
fied, then the jth position is copied from the target vector xi,g .
jrand is a random integer in the range [1, D] to ensure that                Algorithm 3 gives the pseudo code describing DEHB. DEHB
at least one dimension is copied from the mutant, in case the               takes as input the parameters for HB (bmin , bmax , η) and the
    4                                                                           5
        a random value from [0, 1] is chosen uniformly in this work                 DE is a minimizer
X2                                                                 Algorithm 2: A SH bracket under Hyperband
                                      Global Optimum                      Input:
                                                                          bmin , bmax - min and max budgets
                                                                          η - fraction of configurations promoted
                   U = (V 1 , X 2 )              X                        iteration - iteration number
                                                                          Output: List of no. of configurations and budgets
                                                                                            bmax
                                                                        1 smax = blogη
                                                                                            bmin c
                                                                        2 s = smax − (iteration mod (smax + 1))
                                                                                  smax +1
                                                                        3 N = d
                                                                                    s+1     · ηs e
                   U =V                                                          bmax     −s
                                                                        4 b0 =
                                                U = (X 1 , V 2 )                 bmin · η
                                                                        5 budgets = n conf igs = []
                                                                        6 for i ∈ {0, ..., s} do
                                                                        7      Ni = bN · η −i c
                                                          X1            8      b = b0 · η i
                        Crossover                                       9      n conf igs.append(Ni )
                                                                       10      budgets.append(b)
Figure 15: Illustration of DE Crossover operation for a 2-             11 end
dimensional case using the binomial crossover. The vertex of the       12 return n conf igs, budgets
rectangle shows the possible solutions of between a parent x and
mutant v. Based on the choice of p, the resultant individual will
either be a copy of the parent, or the mutant, or incorporate either
component from parent and mutant                                       rest. During the first DEHB bracket (bracket counter== 0)
                                                                       and its second SH bracket onwards (i>0), the top configura-
                                                                       tions from the lower fidelity are promoted6 for evaluation in
 Algorithm 1: DE Optimizer                                             the next higher fidelity. The function DE trial generation on
  Input:                                                               L14, i.e, the sequence of mutation-crossover operations, gen-
  f - black-box problem                                                erates a candidate configuration (config) to be evaluated for
  F - scaling factor (default F = 0.5)                                 all other scenarios. L17 carries out the DE selection proce-
  p - crossover rate (default p = 0.5)                                 dure by comparing the fitness score of config and the selected
  N - population size                                                  target for that DE evolution step. The target (xi,g from Equa-
  Output: Return best found individual in pop                          tion 4) is selected on L9 by a rolling pointer over the sub-
                                                                       population list. That is, for every iteration (every increment
 1   g = 0, F E = 0;                                                   of j) a pointer moves forward by one index position in the
 2   popg ← initial population(N , D);                                 subpopulation selecting an individual to be a target. When
 3   f itnessg ← evaluate population(popg );                           this pointer reaches the maximal index, it resets to point back
 4   FE = N;                                                           to the starting index of the subpopulation. L18 compares the
 5   while (F E < F Emax ) do                                          score of the last evaluated config with the best found score so
 6       mutate(popg );                                                far. If the new config has a better fitness score, the best found
 7       of f springg ← crossover(popg );                              score is updated and the new config is marked as the incum-
 8       f itnessg ← evaluate population(of f springg );               bent, conf iginc . This stores the best found configuration as
 9       popg+1 ,f itnessg+1 ← select(popg ,of f springg );            an anytime best performing configuration.
10       FE = FE + N;
11       g = g+1;                                                      C.2    Handling Mixed Data Types
12   end
13   return Individual with highest fitness seen                       When dealing with discrete or categorical search spaces, such
                                                                       as the NAS problem, the best way to apply DE with such pa-
                                                                       rameters is to keep the population continuous and perform
                                                                       mutation and crossover normally (Eq. 3, 4); then, to evaluate
parameters for DE (F , p). For the experiments in this pa-             a configuration we evaluate a copy of it in the original discrete
per, the termination condition was chosen as the total num-            space as we explain below. If we instead dealt with a dis-
ber of DEHB brackets to run. However, in our implementa-               crete population, then the diversity of population would drop
tion it can also be specified as the total absolute number of          dramatically, leading to many individuals having the same
function evaluations, or a cumulative wallclock time as bud-           parameter values; the resulting population would then have
get. L6 is the call to Algorithm 2 which gives a list of bud-          many duplicates, lowering the diversity of the difference dis-
gets which represent the sequence of increasing budgets to be          tribution and making it hard for DE to explore effectively. We
used for that SH bracket. The nomenclature DE[budgets[i]],             designed DEHB to scale all parameters of a configuration in a
used in L9 and L12, indicates the DE subpopulation asso-
ciated with the budgets[i] fidelity level. The if...else block            6
                                                                            only evaluate on higher budget and not evolve using mutation-
from L11-15 differentiates the first DEHB bracket from the             crossover-selection
32+32                        Figure 16: Comparing DEHB
   Algorithm 3: DEHB

                                                                     normalized validation regret
                                                                                                4 × 10        1
                                                                                                                                                                            encodings for the Stochastic
    Input:                                                                                      3 × 10        1                                                             Counting Ones problem in 64
    bmin , bmax - min and max budgets                                                           2 × 10        1
                                                                                                                                                                            dimensional space with 32 cat-
    η - (default η=3)                                                                                                   DEHB (orig)
                                                                                                                                                                            egorical and 32 continuous hy-
    F - scaling factor (default F = 0.5)                                                                                DEHB (encoding)                                     perparameters. Results for all
                                                                                                             10    1       100           101           102      103   104
                                                                                                                                   cummulative budget / bmax                algorithms on 50 runs.
    p - crossover rate (default p = 0.5)
    Output: Best found configuration, conf iginc
                    bmax                                                                                                                  Cifar-100
 1 smax = blogη
                    bmin c                                                                          10   1
                                                                                                                                                                            Figure 17: Comparing DEHB
 2 Initialize (smax + 1) DE subpopulations randomly

                                                                     validation regret
                                                                                                    10   2
                                                                                                                                                                            encodings for the Cifar-100
 3 bracket counter = 0                                                                                                                                                      dataset from NAS-Bench-201’s
 4 while termination condition do                                                                   10   3
                                                                                                                                                                            6-dimensional space. Results
                                                                                                                       DEHB (orig)
  5      for iteration ∈ {0, 1, ..., smax } do                                                      10   4             DEHB (encoding)                                      for all algorithms on 50 runs.
                                                                                                             102          103           104            105      106   107
  6          budgets, n conf igs =                                                                                               estimated wallclock time [s]

              SH bracket under HB(bmin , bmax , η,
              iteration)
  7          for i ∈ {0, 1, ..., smax − iterations} do                      ables derived from 32 binary variables. We choose a repre-
  8              for j ∈ {1, 2, ..., n conf igs[i]} do                      sentative set of benchmarks (NAS-Bench-201 and Counting
  9                  target = rolling pointer for                           Ones) to compare DEHB with the two encodings mentioned,
                       DE[budgets[i]]                                       in Figures 16, 17. It is enough to see one example which
 10                  mutation types = “vanilla” if i is 0                   performs much worse than the DE-NAS [Awad et al., 2020]
                       else “altered”                                       encoding we chose for DEHB. The encoding from [Vallati
 11                  if bracket counter is 0 and i >0 then                  et al., 2015] did not achieve a better final performance than
 12                       conf ig = j-th best config from                   DEHB in any of our experiments.
                           DE[budgets[i − 1]]
 13                  else                                                   C.3                                         Parallel Implementation
 14                       conf ig =                                         The DEHB algorithm is a sequence of DEHB Brackets,
                           DE trial generation(target,                      which in turn are a fixed sequence of SH brackets. This fea-
                           mutation type)                                   ture, along with the asynchronous nature of DE allows a par-
 15                  end                                                    allel execution of DEHB. We dub the main process as the
 16                  result = Evaluate conf ig on                           DEHB Orchestrator which maintains a single copy of all DE
                       budgets[i]                                           subpopulations. An HB bracket manager determines which
 17                  DE selection using result, conf ig vs.                 budget to run from which SH bracket. Based on this input
                       target                                               from the bracket manager, the orchestrator can fetch a config-
 18                  Update incumbent, conf iginc                           uration7 from the current subpopulations and make an asyn-
19               end                                                        chronous call for its evaluation on the assigned budget. The
20           end                                                            rest of the orchestrator continues synchronously to check for
21       end                                                                free workers, and query the HB bracket manager for the next
22       bracket counter += 1                                               budget and SH bracket. Once a worker finishes computation,
23 end                                                                      the orchestrator collects the result, performs DE selection and
24 return conf iginc                                                        updates the relevant subpopulation accordingly. This form of
                                                                            an update is referred to as immediate, asynchronous DE.
                                                                               DEHB uses a synchronous SH routine. Though each of the
population to a unit hypercube [0, 1], for the two broad types              function evaluations at a particular budget can be distributed,
of parameters normally encountered:                                         a higher budget needs to wait on all the lower budget evalu-
   • Integer and float parameters, X i ∈ [ai , bi ] are retrieved           ations to be finished. A higher budget evaluation can begin
     as: ai + (bi − ai ) · Ui,g , where the integer parameters are          only once the lower budget evaluations are over and the top
     additionally rounded.                                                  1/η can be selected. However, the asynchronous nature of DE
                                                                            allows a new bracket to begin if a worker is available while
   • Ordinal and categorical parameters, X i ∈ {x1 , ..., xn },             existing SH brackets have pending jobs or are waiting for re-
     are treated equivalently s.t. the range [0, 1] is divided              sults. The new bracket can continue using the current state
     uniformly into n bins.                                                 of DE subpopulations maintained by the DEHB Orchestra-
   We also experimented with another encoding design where                  tor. Once the pending jobs from previous brackets are over,
each category in each of the categorical variables are repre-               the DE selection updates the DEHB Orchestrator’s subpop-
sented as a continuous variables [0, 1] and the variable with               ulations. Thus, the utilisation of available computational re-
the max over the continuous variables is chosen as the cat-                 sources is maximized while the central copy of subpopula-
egory [Vallati et al., 2015]. For example, in Figure 16, the                tions maintained by the Orchestrator ensures that each new
effective dimensionality of the search space will become 96-
                                                                                                             7
dimensional — 32 continuous variables + 64 continuous vari-                                                        DE mutation and crossover to generate configuration
4+4                                                                                    8+8
SH bracket spawned works with the latest updated subpopu-                                       10   1                                                                                 10   1

                                                                     normalized validation regret

                                                                                                                                                            normalized validation regret
                                                                                                                                                                                       10   2
lation.                                                                                         10   2
                                                                                                              RS                                                                                     RS
                                                                                                10   3        HB                                                                       10   3
                                                                                                                                                                                                     HB
                                                                                                              BOHB                                                                     10   4        BOHB
                                                                                                10   4
                                                                                                              TPE                                                                                    TPE
                                                                                                10   5        SMAC                                                                     10   5        SMAC
                                                                                                              RE                                                                                     RE
                                                                                                                                                                                       10   6

D     More details on Experiments                                                               10
                                                                                                10
                                                                                                     6

                                                                                                     7
                                                                                                              DE
                                                                                                              DEHB                                                                     10   7
                                                                                                                                                                                                     DE
                                                                                                                                                                                                     DEHB
                                                                                                         10   1   100      101    102    103    104   105                                       10   1   100      101    102    103    104   105
                                                                                                                        cummulative budget / bmax                                                              cummulative budget / bmax

D.1    Baseline Algorithms                                                                                                       16+16                                                                                  32+32

                                                                     normalized validation regret

                                                                                                                                                            normalized validation regret
                                                                                                10   1
                                                                                                              RS                                                                                     RS
In all our experiments we keep the configuration of all the                                                   HB
                                                                                                              BOHB                                                                     10   1
                                                                                                                                                                                                     HB
                                                                                                                                                                                                     BOHB
algorithms the same. These settings are well-performing                                         10   2
                                                                                                              TPE
                                                                                                              SMAC
                                                                                                                                                                                                     TPE
                                                                                                                                                                                                     SMAC
setting that have been benchmarked in previous works —                                                        RE                                                                                     RE
                                                                                                              DE                                                                                     DE
                                                                                                              DEHB                                                                                   DEHB
[Falkner et al., 2018], [Ying et al., 2019], [Awad et al., 2020].                                        10   1   100      101    102    103    104   105                                       10   1   100      101    102    103    104   105
                                                                                                                        cummulative budget / bmax                                                              cummulative budget / bmax
   Random Search (RS) We sample random architectures in
the configuration space from a uniform distribution in each         Figure 18: Results for the Stochastic Counting Ones problem for
generation.                                                         N = {4, 8, 16, 32} respectively indicating N categorical and N
   BOHB We used the implementation from https://github.             continuous hyperparameters for each case. All algorithms shown
com/automl/HpBandSter. In [Ying et al., 2019], they identi-         were run for 50 runs.
fied the settings of key hyperparameters as: η is set to 3, the
minimum bandwidth for the kernel density estimator is set to        D.2                                  Artificial Toy Function: Stochastic Counting
0.3 and bandwidth factor is set to 3. In our experiments, we
                                                                                                         Ones
deploy the same settings.
   Hyperband (HB) We used the implementation from https:            The Counting Ones benchmark was designed to minimize the
//github.com/automl/HpBandSter. We set η = 3 and this pa-           following objective function:
                                                                                                                    
rameter is not free to change since there is no other different                         X          X
budgets included in the NAS benchmarks.                                  f (x) = −           xi +      Eb [(Bp=xj )] ,
   Tree-structured Parzen estimator (TPE) We used                                                                                       xi ∈Xcat            xj ∈Xcont
the open-source implementation from https://github.com/
hyperopt/hyperopt. We kept the settings of hyperparameters          where the sum of the categorical variables (xi ∈ {0, 1}) rep-
to their default.                                                   resents the standard discrete counting ones problem. The con-
                                                                    tinuous variables (xj ∈ [0, 1]) represent the stochastic com-
   Sequential Model-based Algorithm Configura-
                                                                    ponent with the budget b controlling the noise. The budget
tion (SMAC) We used the implementation from
                                                                    here represents the number of samples used to estimate the
https://github.com/automl/SMAC3 under its default pa-
                                                                    mean of the Bernoulli distribution (B) with parameters xj .
rameter setting. Only for the Counting Ones problem with
                                                                       The experiments on the Stochastic Counting Ones bench-
64-dimensions, the initial design had to be changed to a
                                                                    mark used N = {4, 8, 16, 32}, all of which are shown in Fig-
Latin Hypercube design, instead of a Sobol design.
                                                                    ure 18. For the low dimensional cases, BOHB and SMAC’s
   Regularized Evolution (RE) We used the implementation            models are able to give them an early advantage. For this toy
from [Real et al., 2019]. We initially sample an edge or op-        benchmark the global optima is located at the corner of a unit
erator uniformly at random, then we perform the mutation.           hypercube. Random samples can span the lower dimensional
After reaching the population size, RE kills the oldest mem-        space adequately for a model to improve the search rapidly.
ber at each iteration. As recommended by [Ying et al., 2019],       DEHB on the other hand may require a few extra function
the population size (PS) and sample size (TS) are set to 100        evaluations to reach similar convergence. However, this con-
and 10 respectively.                                                servative approach aids DEHB for the high-dimensional cases
   Differential Evolution (DE) We used the implementation           where it is able to converge much more rapidly in comparison
from [Awad et al., 2020], keeping the rand1 strategy for mu-        to other algorithms. Especially where SMAC and BOHB’s
tation and binomial crossover as the crossover strategy. We         convergence worsens significantly. DEHB thus showcases
also use the same population size of 20 as [Awad et al., 2020].     its robust performance even when the dimensionality of the
   All plots for all baselines were plotted for the incumbent       problem increases exponentially.
validation regret over the estimated wallclock time, ignoring
the optimization time. The x-axis therefore accounts for only       D.3                                  Feed-forward networks on OpenML datasets
the cumulative cost incurred by function evaluations for each       Figure 19, show the results on all 6 datasets from OpenML
algorithm. All algorithms were run for similar actual wall-         surrogates benchmark — Adult, Letter, Higgs, MNIST,
clock time budget. Certain algorithms under certain bench-          Optdigits, Poker. The surrogate model space is just 6-
marks may not appear to have equivalent total estimated wall-       dimensional, allowing BOHB and TPE to build more confi-
clock time. That is an artefact of ignoring optimization time.      dent models and be well-performing algorithms in this space,
Model-based algorithms such as SMAC, BOHB, TPE have                 especially early in the optimization. However, DE and DEHB
a computational cost dependent on the observation history.          are able to remain competitive and consistently achieve an
They might undertake lesser number of function evaluations          improved final performance than TPE and BOHB respec-
for the same actual wallclock time.                                 tively. While even TPE achieves a better final performance
Adult                                                                                       Higgs
                                                                                                                      10    1                                                                   7 hyperparameters: # units layer 1, # units layer 2, batch size,
                                     RS                                                                                               RS                                                        learning rate, discount, likelihood ratio clipping and entropy
 validation regret

                                                                                                validation regret
                       10    2       HB                                                                                               HB
                                     BOHB
                                     TPE
                                                                                                                                      BOHB
                                                                                                                                      TPE
                                                                                                                                                                                                regularization. Figure 21 summarises the performance of all
                                                                                                                      10    2
                                     SMAC                                                                                             SMAC                                                      algorithms on the RL problem for the Cartpole benchmark.
                                     RE                                                                                               RE
                                     DE                                                                                               DE
                       10    3       DEHB                                                                                             DEHB                                                      SMAC uses a SOBOL grid as its initial design and both its
                             101         102             103           104         105                                      101         102            103          104            105
                                               estimated wallclock time [s]                                                                   estimated wallclock time [s]                      benefit and drawback can be seen as SMAC rapidly improves,
                                                           Letter                                                                                         Mnist
                                                                                                                                                                                                stalls, and then improves again once model-based search be-
                                                                                                                      10    1
                       10    1
                                     RS                                                                                               RS
                                                                                                                                                                                                gins. However, BOHB and DEHB both remain competi-
 validation regret

                                                                                                validation regret
                                     HB                                                                                               HB
                                     BOHB
                                                                                                                      10    2
                                                                                                                                      BOHB                                                      tive and BOHB, DEHB, SMAC emerge as the top-3 for this
                       10    2       TPE                                                                                              TPE
                                     SMAC                                                                             10    3         SMAC                                                      benchmark, achieving similar final scores. We notice that
                                     RE                                                                                               RE
                       10    3
                                     DE
                                     DEHB                                                                             10    4
                                                                                                                                      DE
                                                                                                                                      DEHB
                                                                                                                                                                                                the DE trace stands out as worse than RS and will explain
                                   101           102             103
                                               estimated wallclock time [s]
                                                                             104         105                                101         102             103            104
                                                                                                                                              estimated wallclock time [s]
                                                                                                                                                                                   105          the reason behind this. Given the late improvement for DE
                                                         Optdigits                                                                                        Poker                                 pop = 20, we posit that this is a result of the deferred updates
                       10    1
                                                                                                                      10    1                                                                   of DE based on the classical DE [Awad et al., 2020] update
                                     RS                                                                                               RS
 validation regret

                                                                                                validation regret

                                     HB                                                                                               HB                                                        design and also the design of the benchmark.
                                     BOHB                                                                             10    2         BOHB
                                     TPE                                                                                              TPE
                       10    2
                                     SMAC                                                                                             SMAC                                                         For classical-DE, the updates are deferred, that is the re-
                                     RE                                                                                               RE
                                     DE                                                                               10    3
                                                                                                                                      DE                                                        sults of the selection process are incorporated into the popu-
                       10    3       DEHB                                                                                             DEHB
                            100          101            102            103
                                               estimated wallclock time [s]
                                                                                   104                                          102          103              104
                                                                                                                                              estimated wallclock time [s]
                                                                                                                                                                             105                lation for consideration in the next evolution step, only after
                                                                                                                                                                                                all the individuals of the population have undergone evolu-
Figure 19: Results for the OpenML surrogate benchmark for the 6                                                                                                                                 tion. In terms of computation, the wall-clock time for popu-
datasets: Adult, Higgs, Letter, MNIST, Optdigits, Poker. The search                                                                                                                             lation size number of function evaluations are accumulated,
space had 6 continuous hyperparameters. All plots shown were av-                                                                                                                                before the population is updated. In Figure 21 we illustrate
eraged over 50 runs of each algorithm.                                                                                                                                                          why given how this benchmark is designed, this minor de-
                                                                                                                                                                                                tail for DE slows down convergence. Along with a DE of
                                                                                                                                                                                                population size 20 as used in the experiments, we compare
than BOHB. Overall, DEHB is a competetive anytime per-                                                                                                                                          a DE of population size 10 in Figure 21. For the Reinforce-
former for this benchmark with the most robust final perfor-                                                                                                                                    ment Learning benchmark from [Falkner et al., 2018], each
mances.                                                                                                                                                                                         full budget function evaluation consists of 9 trials of a maxi-
                                                                                                                                                                                                mum of 3000 episodes. With a population of 20, DE will not
D.4                                Bayesian Neural Networks
                                                                                                                                                                                                inject a new individual into a population unless all 20 individ-
The search space for the two-layer fully-connected Bayesian                                                                                                                                     uals have participated as a parent in the crossover operation.
Neural Network is defined by 5 hyperparameters which are:                                                                                                                                       This accumulates wallclock time equivalent to 20 individu-
the step length, the length of the burn-in period, the number                                                                                                                                   als times 9 trials times time taken for a maximum of 3000
of units in each layer, and the decay parameter of the momen-                                                                                                                                   episodes. Which can explain the flat trajectories in the op-
tum variable. In Figure 20, we show the results for the tuning                                                                                                                                  timization trace for DE pop = 20 in Figure 21 (right). DE
of Bayesian Neural Networks on both the Boston Housing                                                                                                                                          pop = 10 slashes this accumulated wallclock time in half
and Protein Structure datasets for the 6-dimensional Bayesian                                                                                                                                   and is able to inject newer configurations into the population
Neural Networks benchmark. We observe that SMAC, TPE                                                                                                                                            faster and is able to search faster. Given enough runtime,
and BOHB are able to build models and reach similar re-                                                                                                                                         we expect DE pop = 20 to converge to similar final scores.
gions of performance with high confidence. DEHB is slower                                                                                                                                       DEHB uses the immediate update design for DE, wherein it
to match in such a low-dimensional noisy space. However,                                                                                                                                        updates the population immediately after a DE selection, and
given the same cumulative budget, DEHB achieves a com-                                                                                                                                          not wait for the entire population to evolve. We posit that this
petitive final score.                                                                                                                                                                           feature, along with lower fidelity search, and performing grid
                                                   Boston Housing                                                                                  Protein Structure
                                                                                                                                                                                                search over population sizes with Hyperband, enables DEHB
                       70                                                          RS
                                                                                   HB
                                                                                                                      70                                                           RS
                                                                                                                                                                                   HB
                                                                                                                                                                                                to be more practical than classical-DE.
 negative log-likelihood

                                                                                                negative log-likelihood

                       60                                                                                             60
                                                                                   BOHB                                                                                            BOHB
                       50                                                          TPE                                50                                                           TPE
                                                                                   SMAC                                                                                            SMAC
                       40                                                          RE                                 40                                                           RE                                                         cartpole
                                                                                                                                                                                                                        104                                                           104                 Cartpole
                       30                                                          DE                                 30                                                           DE
                                                                                   DEHB                                                                                            DEHB
                                                                                                                                                                                                 epochs until convergence

                                                                                                                                                                                                                                                               epochs until convergence

                       20                                                                                             20
                       10                                                                                             10
                           104                             105                            106                             104                             105                             106                           103                                                           103
                                                       MCMC steps                                                                                    MCMC steps
                                                                                                                                                                                                                                RS           SMAC
                                                                                                                                                                                                                                HB           RE                                             RS
                                                                                                                                                                                                                                BOHB         DE                                             DE pop = 10
Figure 20: Results for tuning 5 hyperparameters of a Bayesian                                                                                                                                                           1010
                                                                                                                                                                                                                          2 2
                                                                                                                                                                                                                                TPE          DEHB
                                                                                                                                                                                                                                                                                      102
                                                                                                                                                                                                                                                                                            DE pop = 20
                                                                                                                                                                                                                                       103               104                                              104        105
Neural Network on the the Boston Housing and Protein Structure                                                                                                                                                                                time [s]                                                    time [s]

datasets respectively, for 50 runs of each algorithm.
                                                                                                                                                                                                Figure 21: (left) Results for tuning PPO on OpenAI Gym cartpole
                                                                                                                                                                                                environment with 7 hyperparameters. Each algorithm shown was
D.5                                Reinforcement Learning                                                                                                                                       run for 50 runs. (right) Same experiment to compare DE with a
                                                                                                                                                                                                population size of 10 and 20.
For this benchamrk, the proximal policy optimization (PPO)
[Schulman et al., 2017] implementation is parameterized with
You can also read