Surrogate Model Based Hyperparameter Tuning for Deep Learning with SPOT

Page created by Ivan Daniels
 
CONTINUE READING
Surrogate Model Based Hyperparameter Tuning
                                                     for Deep Learning with SPOT

                                                                                           A Preprint

                                                                                     Thomas Bartz-Beielstein ∗
                                                                       Institute for Data Science, Engineering, and Analytics
arXiv:2105.14625v1 [cs.LG] 30 May 2021

                                                                                     Technische Hochschule Köln
                                                                                    5164 Gummersbach, Germany
                                                                             thomas.bartz-beielstein@th-koeln.de

                                                                                            June 1, 2021

                                                                                            Abstract
                                                    A surrogate model based hyperparameter tuning approach for deep learning is presented. This
                                                    article demonstrates how the architecture-level parameters (hyperparameters) of deep learning
                                                    models that were implemented in Keras/tensorflow can be optimized. The implementation of
                                                    the tuning procedure is 100% based on R, the software environment for statistical computing.
                                                    With a few lines of code, existing R packages (tfruns and SPOT) can be combined to perform
                                                    hyperparameter tuning. An elementary hyperparameter tuning task (neural network and
                                                    the MNIST data) is used to exemplify this approach.

                                         K eywords hyperparameter tuning · deep learning · hyperparameter optimization · surrogate model based
                                         optimization · sequential parameter optimization

                                         1       Introduction
                                         Deep-learning models require the specification of several architecture-level parameters, which are called
                                         hyperparameters. Hyperparameters are to be distinguished from the parameters of a model that are optimized
                                         during the training phase, e.g., via backpropagation. We will consider hyperparameters in this article. Typical
                                         questions regarding hyperparameters in deep-learning models are as follows:

                                                  1. How many layers should be stacked?
                                                  2. Which dropout rate should be used?
                                                  3. How many filters (units) should be used in each layer?
                                                  4. Which activation function should be used?

                                         In many real-world projects, experienced deep-learning engineers build intuition over time as to what
                                         reasonable hyperparameters are, i.e., Hyperparameter Tuning (HPT) skills are developed. These skills are
                                         based on human expert and domain knowledge and not on valid formal rules. Chollet and Allaire [2018]
                                         describe the situation as follows:

                                                     If you want to get to the very limit of what can be achieved on a given task, you can’t
                                                     be content with arbitrary [hyperparameter] choices made by a fallible human. Your initial
                                                     decisions are almost always suboptimal, even if you have good intuition. You can refine
                                                     your choices by tweaking them by hand and retraining the model repeatedly—that’s what
                                                     machine-learning engineers and researchers spend most of their time doing.
                                             ∗
                                                 https://www.spotseven.de
A preprint - June 1, 2021

         But it shouldn’t be your job as a human to fiddle with hyperparameters all day—that is
         better left to a machine.

HPT develops tools to to explore the space of possible hyperparameter configurations systematically, in a
principled way. The HPT process typically looks like this:

(HPT-1) Parameter selection. Choose a set of hyperparameters.
(HPT-2) Model building. Build the corresponding Deep Neural Network (DNN) model.
(HPT-3) Model training and evaluation. Fit the model to the training data and measure the final performance
        on the validation data.
(HPT-4) Parameter update. Choose the next set of hyperparameters to try.
(HPT-5) Looping. Repeat until budget is exhausted.
(HPT-6) Final evaluation on test data, i.e., measuring performance on the test (hold out) data.

Training the weights of a DNN model is relatively easy, e.g., a loss function is evaluated on a mini-batch of
data and then the backpropagation algorithm is used to move the weights in the right direction [Chollet and
Allaire 2018]. Updating hyperparameters, on the other hand, is extremely challenging, because computing the
feedback signal can be extremely expensive: it requires creating and training a new model on a dataset. And,
the hyperparameter space is not continuous or differentiable, because it is also made of discrete decisions.
Standard gradient methods are not applicable in hyperparameter space. Instead, gradient-free optimization
techniques, which sometimes are far less efficient than gradient methods, are applied.
These challenges are difficult. The key to this process is the HPT algorithm that uses this history of validation
performance, given various sets of hyperparameters, to choose the next set of hyperparameters to evaluate.
Several different approaches are possible:

     • manual search,
     • simple random search, i.e., choosing hyperparameters to evaluate at random, repeatedly,
     • grid and pattern search [Meignan et al. 2015] [Lewis et al. 2000] [Tatsis and Parsopoulos 2016],
     • model free algorithms, i.e., algorithms that do not explicitly make use of a model, e.g., evolutionary
       algorithms [Hansen 2006] [Bartz-Beielstein et al. 2014],
     • hyperband, i.e., a multi-armed bandit strategy that dynamically allocates resources to a set of random
       configurations and uses successive halving to stop poorly performing configurations [Li et al. 2016],
     • Surrogate Model Based Optimization (SMBO) such as Sequential Parameter Optimization Toolbox
       (SPOT), [Bartz-Beielstein et al. 2005] [Bartz-Beielstein et al. 2021]

Grid search and manual search are probably the most widely used strategies for HPT. Bergstra and Bengio
[2012] demonstrate empirically and show theoretically that randomly chosen trials are more efficient for HPT
than trials on a grid. Due to its simplicity, it turns out in many situations that random search is the best
solution. Hyperband can result in a worse final performance than model-based approaches, because it only
samples configurations randomly and does not learn from previously sampled configurations.
SMBO based on Kriging (aka Gaussian processes or Bayesian Optimization (BO)) has been successfully
applied to HPT in several works, e.g., Bartz-Beielstein and Markon [2004] propose a combination of classical
statistical tools, BO (Design and Analysis of Computer Experiments (DACE)), and Classification and
Regression Trees (CART) as a surrogate model. Snoek et al. [2012] used the CIFAR-10 dataset, which
consists of 60,000 32x32 color images in ten classes, for optimizing the hyperparameters of a Convolutional
Neural Networks (CNNs). Bergstra et al. [2013] optimized a three layer CNN and Mendoza et al. [2019]
participated in the 2016 AutoML challenge. O’Malley et al. [2019] present Keras tuner, a hyperparameter
tuner for Keras with TensorFlow 2.0. After defining a model-building function, which takes an argument
from which hyperparameters such as the units (hidden nodes) of the neural network, can be sampled. Next,
a tuner is instantiated. Available tuners are RandomSearch and Hyperband.
HPT is a powerful technique that is an absolute requirement to get to state-of-the-art models on any task
or to win machine-learning competitions. However, there are important issues to keep in mind when doing
HPT: e.g., validation-set overfitting can occur, because hyperparameters are optimized based on information
derived from the validation data. Falkner et al. [2018] claim, that practical Hyperparameter Optimization
(HPO) solutions should fulfill the following requirements: strong anytime and final performance, effective use

                                                       2
A preprint - June 1, 2021

of parallel resources, scalability, as well as robustness and flexibility. A treatment for these issues is proposed
by Bartz-Beielstein et al. [2020]. Although their recommendations were developed for benchmark studies in
optimization, they are also relevant for HPT, because HPT can be seen as a special benchmarking variant.

(R-1)   Goals: what are the reasons for performing benchmark studies?
(R-2)   Problems: how to select suitable problems?
(R-3)   Algorithms: how to select a portfolio of algorithms to be included in the benchmark study?
(R-4)   Performance: how to measure performance?
(R-5)   Analysis: how to evaluate results?
(R-6)   Design: how to set up a study, e.g., how many runs shall be performed?
(R-7)   Presentation: how to describe results?
(R-8)   Reproducibility: how to guarantee scientifically sound results and how to guarantee a lasting impact,
        e.g., in terms of comparability?

In addition to the recommendations, there are some specific issues that are caused by the deep-learning setup.
These will be discussed in Sec. 4.
Note, sometimes the terms HPT and HPO are used synonymously. We will use these terms with different
meanings:

HPO develops and applies methods to determine the best hyperparameters in an effective and efficient
     manner.
HPT develops and applies methods that try to understand the effects and interaction of hyperparameters to
     enable learning and understanding.

This article proposes a HPT approach based on SPOT that focuses on the following topics:

Understanding. In contrast to standard HPO approaches, SPOT provides statistical tools for understanding
      hyperparameter importance and interactions between several hyperparameters.
Transparency and Explainability. Understanding is a key tool for enabling transparency, e.g., quantifying
      the contribution of deep learning components (layers, activation functions, etc.).
Reproducibility. The software code used in this study is available in the open source R software environment
      for statistical computing and graphics (R) package SPOT via Comprehensive R Archive Network
      (CRAN).

This paper is structured as follows: Section 2 describes materials and methods that were used for the
experimental setup. Section 3 presents results from a simple experiment. A discussion is presented in Sec. 4.
The appendix contains information on how to set up the Python software environment for performing HPT
with SPOT and Training Run Tools for TensorFlow (tfruns). Source code for performing the experiments
will included in the R package SPOT. Further information are published on https://www.spotseven.de
and, with some delay, on CRAN (https://cran.r-project.org/package=SPOT).

2     Materials and Methods
2.1     Benchmarks and Software Packages
Before presenting the elements (benchmarks and software tools) for the experiments, we consider similar
approaches that might be worth looking at. This list is not complete and will be updated in forthcoming
versions of this paper.

2.1.1    Software Packages
2.1.2    Hyperparameter Optimization Software
Balaprakash et al. [2018] present DeepHyper, a Python package that provides a common interface for the
implementation and study of scalable hyperparameter search methods. Karmanov et al. [2018] created a
Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from

                                                        3
A preprint - June 1, 2021

Table 1: The hyperparameters and architecture choices for the fully connected networks as defined in Falkner
et al. [2018]
                  Hyperparameter             Lower Bound      Upper Bound      Log-transform
                  batch size                           23               28                yes
                  dropout rate                          0              0.5                 no
                  initial learning rate            1e − 6           1e − 2                yes
                  exponential decay factor        −0.185                 0                 no
                  # hidden layers                       1                5                 no
                  # units per layer                    24               28                yes

one framework to another. They provide a common setup for comparisons across GPUs (potentially CUDA
versions and precision) and for comparisons across languages (Python, Julia, R). Users should be able to
verify expected performance of own installation.

2.1.3    Artificial Toy Functions

Because BO does not work well on high-dimensional mixed continuous and categorical configuration spaces,
Falkner et al. [2018] use a simple counting ones problem to analyze this problem. Zaefferer and Bartz-
Beielstein [2016] discuss these problems in greater detail. How to implement BO for discrete (and continuous)
optimization problems is analyzed in the seminal paper by Bartz-Beielstein and Zaefferer [2017].

2.1.4    Experiments on Surrogate Benchmarks

Falkner et al. [2018] optimized six hyperparameters that control the training procedure of a fully connected
neural network (initial learning rate, batch size, dropout, exponential decay factor for learning rate) and the
architecture (number of layers, units per layer) for six different datasets gathered from OpenML [Vanschoren
et al. 2014], see Table 1.
Instead of training the networks directly, Falkner et al. [2018] used a surrogate network. To build a surrogate,
they sampled 10,000 random configurations for each data set, trained them for 50 epochs, and recorded
classification error after each epoch, and total training time. Two independent random forests models were
fitted to predict these two quantities as a function of the hyperparameter configuration used. Falkner et al.
[2018] note that Hypberband (HB) initially performed much better than the vanilla BO methods and achieved
a roughly three-fold speedup over random search. Instead of using the surrogate network, we will use the
original networks. Our approach is described in Sec.2.3.

2.2     Software: Keras, Tensorflow, tfruns, and SPOT

2.2.1    Keras and Tensorflow

Keras is the high-level Application Programming Interface (API) of TensorFlow (TF), which is developed with
a focus on enabling fast experimentation. TF is an open source software library for numerical computation
using data flow graphs [Abadi et al. 2016]. Nodes in the graph represent mathematical operations, while the
graph edges represent the multidimensional data arrays (tensors) communicated between them [O’Malley
et al. 2019]. The tensorflow R package provides access to the complete TF API from within R.

2.2.2    The R Package tfruns

The R package tfruns2 provides a suite of tools for tracking, visualizing, and managing TF training runs and
experiments from R. tfruns enables tracking the hyperparameters, metrics, output, and source code of every
training run and comparing hyperparameters and metrics across runs to find the best performing model. It
automatically generates reports to visualize individual training runs or comparisons between runs. tfruns can
be used without any changes to source code, because run data is automatically captured for all Keras and
TF models.

  2
      https://cran.r-project.org/package=tfruns, https://tensorflow.rstudio.com/tools/tfruns

                                                       4
A preprint - June 1, 2021

2.2.3      SPOT
The SPOT package for R is a toolbox for tuning and understanding simulation and optimization algorithms
[Bartz-Beielstein et al. 2021]. SMBO investigations are common approaches in simulation and optimization.
Sequential parameter optimization has been developed, because there is a strong need for sound statistical
analysis of simulation and optimization algorithms. SPOT includes methods for tuning based on classical
regression and analysis of variance techniques; tree-based models such as CART and random forest; BO
(Gaussian process models, aka Kriging), and combinations of different meta-modeling approaches.
SPOT implements key techniques such as exploratory fitness landscape analysis and sensitivity analysis.
SPOT can be used for understanding the performance of algorithms and gaining insight into algorithm’s
behavior. Furthermore, SPOT can be used as an optimizer and for automatic and interactive tuning. SPOT
finds improved solutions in the following way:

      1.   Initially, a population of (random) solutions is created.
      2.   A set of surrogate models is specified.
      3.   Then, the solutions are evaluated on the objective function.
      4.   Next, surrogate models are built.
      5.   A global search is performed to generate new candidate solutions.
      6.   The new solutions are evaluated on the objective function.

These steps are repeated, until a satisfying solution has been found, see [Bartz-Beielstein et al. 2021].
How these software packages (Keras/TF, tfruns, and SPOT) can be combined in a very efficient and effective
manner will be exemplified in the following section.

2.3     Example: Tuning Hyperparameters with SPOT
The general DNN workflow will be as follows: first the training data, train_images and train_labels
are fed to the DNN. The DNN will then learn to associate images and labels. Finally, the trained DNN
should produce predictions for test_images. These predictions are evaluated, i.e., validated whether these
predictions match the labels from test_labels.

2.3.1      The Data Set: MNIST
The DNN in this example uses the Keras R package to learn to classify hand-written digits from the
Modified National Institute of Standards and Technology (MNIST) data set. This is a supervised multi-class
classification problem, i.e., grayscale images of handwritten digits (28 pixels by 28 pixels) should be assigned
to ten categories (0 to 9). MNIST is a set of 60,000 training and 10,000 test images. The MNIST data set
comes preloaded in Keras, in the form of train and test lists, each of which includes a set of images (x) and
associated labels (y): train_images and train_labels form the training set, the data that the model will
learn from. The model will then be tested on the test set, test_images and test_labels. The images are
encoded as as 3D arrays, and the labels are a 1D array of digits, ranging from 0 to 9. There is a one-to-one
correspondence between the images and the labels.
Before training the DNN, the data are preprocessed by reshaping it into the shape the DNN can process. All
values will be scaled that they are in the [0, 1] interval. The natural (original) training images were stored in
an array of shape (60000, 28, 28) of type integer with values in the [0, 255] interval. They are transformed
into a double array of shape (60000, 28 × 28) with Red, Green, and Blue color space (RGB) values between 0
and 1. Furthermore, the labels are categorically encoded.
mnist
A preprint - June 1, 2021

x_train %
    layer_dense(units = 10, activation = 'softmax')

Finally, (i) the loss function, which determines how the DNN good a prediction is based on the training data,
(ii) the optimizer, i.e., the update mechanism of the network, which adjusts the weights using backpropagation,
and (iii) the metrics, which monitor the progress during training and testing, are specified using the compile
function from Keras.
model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_rmsprop(lr = 0.01),
    metrics = c('accuracy')
)

The DNN training can be started as follows (using Kera’s fit function).
history % keras::fit(
    x_train, y_train,
    batch_size = 128,
    epochs = 20,
    verbose = 0,
    validation_split = 0.2
)

Two quantities are being displayed during training:

  (i) the loss of the network over the training data, and
 (ii) the accuracy of the network over the training data.

Figure 1 illustrates that an accuracy greater than 95 percent on the training data can be reached quickly.
Finally, using Kera’s evaluate function, the DNN model performance can be checked on the test data set:
score % evaluate(
    x_test, y_test,
    verbose = 0
)
cat('Test loss:', score[[1]], '\n')
cat('Test accuracy:', score[[2]], '\n')

Test loss: 0.2828377
Test accuracy: 0.9638

                                                       6
A preprint - June 1, 2021

                     0.5

                     0.4
        loss

                     0.3

                     0.2                                                               data
                                                                                              training
                                                                                              validation
                   0.950
        accuracy

                   0.925

                   0.900

                   0.875

                   0.850
                               5               10              15              20
                                               epoch
                            Figure 1: Loss and accuracy plotted against epochs

Running the DNN model as a standalone process before starting the tuning process is strongly recommended.
As shown in this section, the default DNN model seems to work fine. The tuning process can be started next.

2.3.3       Interfacing tfruns from SPOT
After testing the model as a standalone implementation, the model can be combined with the SPOT framework.
A wrapper function is used to connect tfruns to the SPOT tuner. The setup requires a few lines of R code
only.

funTfrunsSingle
A preprint - June 1, 2021

Table 2: The hyperparameters and architecture choices for the first DNN example: fully connected networks
     Variable Name     Hyperparameter               Type       Default   Lower Bound      Upper Bound
     x1                first layer dropout rate     numeric        0.4              0                1
     x2                second layer dropout rate    numeric        0.3              0                1
     x3                units per first layer        integer       256              16             512
     x4                units per second layer       integer       128               4             256
     x5                learning rate                numeric     0.001          0.0001              0.1
     x6                training epochs              integer         20              5               25

      during training [Chollet and Allaire 2018]. The dropout rate of the first and second layer will be tuned
      individually.
 (ii) the number of units, i.e., the number of single outputs from a single layer. The number of units of the
      first and second layer will be tuned individually.
(iii) the learning rate, which controls how much to change the DNN model in response to the estimated
      error each time the model weights are updated.
(iv) the number of training epochs, where a training epoch is one forward and backward pass of a complete
      data set.

These hyperparameters and their ranges are listed in Table 2. Using these parameter specifications, we are
ready to perform the first SPOT HPT run:
res
A preprint - June 1, 2021

Figure 2: Internal ‘list‘ structure of the result object ‘res‘ from the SPOT run. Information about the
surrogate model are omitted here for readability.
List of 9
 $ xbest   : num [1, 1:6] 1.31e-02 4.23e-01 4.73e+02 1.10e+02 4.89e-04 ...
 $ ybest   : num [1, 1] 0.0954
 $ x       : num [1:600, 1:6] 0.9906 0.7771 0.1383 0.0094 0.605 ...
 $ y       : num [1:600, 1] 2.3 2.09 1.46 1.13 1.95 ...
 $ logInfo : logi NA
 $ count   : int 600
 $ msg     : chr "budget exhausted"
 $ modelFit:List of 3
  ..$ rfFit:List of 17
 ...
  .. ..- attr(*, "class")= chr "randomForest"
  ..$ x    : num [1:599, 1:6] 0.9906 0.7771 0.1383 0.0094 0.605 ...
  ..$ y    : num [1:599, 1] 2.3 2.09 1.46 1.13 1.95 ...
  ..- attr(*, "class")= chr "spotRandomForest"
 $ ybestVec: num [1:600] 0.26 0.26 0.26 0.26 0.26 ...

Table 3: Hyperparameter values from iteration 454 and the best configuration. x1 denotes the first layer
dropout rate, x2 the second layer dropout rate, x3 the units per first layer, x4 the units per second layer, x5
the learning rate, and x6 the training epochs. y denotes the values of the loss function (to be minimized).
            hyperparameter configuration            x1       x2     x3     x4          x5    x6          y
            outlier454                           0.102    0.430    482    190    0.058251    11    34.0114
            xbest                                0.013    0.423    473    110    0.000489    14     0.0954

might also require a significant amount of time and resources, even when it is performed automatically) a few
HPT procedures are performed. Insights gained from HPT might help to avoid pitfalls such as ill specified
parameter ranges, too short run times, etc.

In addition to an effective and efficient way to determine the optimal hyperparameter, SPOT provides tools
for learning and understanding.3
The HPT experiment from Sec. 2.3 used n = 600 evaluations, i.e., SPOT generated a result list (res) with
the information shown in Fig.2.
Plots. First of all, the res list information will be used to visualize the route to the solution: in Fig. 3, loss
function values are plotted against the number of iterations. This figure reveals that one hyperparameter
configuration (from iteration 454, it will be referred to as outlier454 in the following) should be investigated
further, because this configuration results in an unusual high loss value of 34.01. Without exception, the
loss values from the other hyperparameter configurations are smaller than 5. Using the default configuration
results in a loss value of 0.28. The related hyperparameters values are shown in Table 3.
Box plots. Secondly, looking at the relationship to the the remaining 599 hyperparameter configurations
from this experiment might be insightful: therefore, Fig. 4 visualizes (using box plots) the ranges of the six
hyperparameters from the complete HPT experiment. Figure 4 reveals that the largest differences between
the best solution and the outlier454 occur for variables x1 , i.e., the first layer dropout rate, and x5 , i.e., the
learning rate.
Regression trees. Thirdly, to analyze effects and interactions between hyperparameters, a simple regression
tree can as shown in Fig. 5 can be used. The regression tree supports the observations, that hyperparameter
values for x1 , i.e., the first layer dropout rate, and x5 , i.e., the learning rate are relevant. To conclude this first
analysis, the interaction between x1 and x5 will be visualized. SPOT provides several tools for the analysis
of interactions. Highly recommended is the use of contour plots as shown in Fig. 6 to shows interactions
between x1 and x5 .
   3
     Or, as Bartz-Beielstein [2006] wrote: “ [SPOT] provides means for understanding algorithms’ performance (we
will use datascopes similar to microscopes in biology and telescopes in astronomy).”

                                                           9
A preprint - June 1, 2021

         20.0
         5.0
         2.0
loss

         0.5
         0.1

                0             100           200            300            400            500            600

                                                       iterations
Figure 3: Loss function values plotted against the number of SPOT iterations. The orange line represents the
loss obtained with the default DNN hyperparameters. The blue line shows the best function value found
during the tuning procedure. Grey squares represent the values generated during the hyperparameter tuning
procedure. The red square shows one extreme value, that should be investigated further. Note: loss values
plotted on a log scale

In addition to the analysis performed so far, surface plots can be used: the surface plot shown in Fig. 6
supports the observations, that hyperparameters x5 and x1 have significant effects on the loss function, their
values should be small.
Linear models. Finally, a simple linear regression model can be fitted to the data. Based on the data from
SPOT’s res list, this can be done as follows:
lm.res |t|)
##     (Intercept)   -0.2332275 0.2630064 -0.887     0.3756
##     res$x1         1.0295949 0.2103293    4.895 1.27e-06 ***
##     res$x2        -0.0059548 0.2108685 -0.028     0.9775
##     res$x3         0.0009967 0.0004168    2.392   0.0171 *
##     res$x4        -0.0010134 0.0007944 -1.276     0.2026
##     res$x5        19.7357685 1.8288093 10.792 < 2e-16 ***

                                                     10
A preprint - June 1, 2021

   1e+02
   1e+00
   1e−02
   1e−04

                    1             2              3               4             5             6
Figure 4: Six box plots, i.e., each plot represents the values of one parameter (plotted on a log scale). The
red squares represent the outlier 454, the blue one show the settings of the best value. Largest deviations
can be observed for x1 and x5 .

                                              1.3
                                             100%
                                yes     x5 < 0.025 no

              0.4                                                            1.8
             38%                                                            62%
           x1 < 0.83                                                   x3 < 482

                                                           1.7                               3
                                                          58%                               4%
                                                     x1 < 0.57                         x1 >= 0.21

                                              1.4
                                             33%
                                        x5 < 0.067

     0.26               1.9            1.1            1.8             2.1           1.8             6.3
     35%                3%            19%            14%             25%            3%              1%
Figure 5: Regression tree based on the first run with 600 evaluations. Apparantly, hyperparameter values
from x5 and x1 are important. This result supports the previous analysis.

                                                     11
A preprint - June 1, 2021

                                                                                                      y

                                                                                                              2.5
     0.08
                                                                                                              2.0
     0.06
                                                                                                              1.5
x5

     0.04
                                                                                                              1.0

     0.02
                                                                                                              0.5

                           0.2             0.4               0.6           0.8

                                                   x1
                    Figure 6: Surface plot: learning rate x5 plotted against dropout1 x1 .

##   res$x6       0.0054800 0.0088735   0.618   0.5371
##   ---
##   Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
##   Residual standard error: 1.404 on 593 degrees of freedom
##   Multiple R-squared: 0.2283, Adjusted R-squared: 0.2205
##   F-statistic: 29.24 on 6 and 593 DF, p-value: < 2.2e-16
Although this linear model requires a detailed investigation (a misspecification analysis is necessary, e.g.,
  2
Radj  ≈ 0.22) it also is in accordance with previous observations that hyperparameters x5 and x1 have
significant effects on the loss function, and their values should be small.

4    Discussion and Conclusions
This study briefly explains, how HPT can be used as a datascope for the optimization of DNN hyperparameters.
The results and observations presented in Sec.3 can be stated as a hypothesis:
(H-1): hyperparameters x1 the first layer dropout rate, and x5 , i.e., the learning rate, have significant effects
       on the loss function. Their values should be small.
This hypothesis requires further investigations.
The results scratch on the surface of the HPT set of tools, e.g., the role and and impact of noise was not
considered. SPOT provides very powerful tools such as Optimal Computing Budget Allocation (OCBA) to
handle noisy function evaluations efficiently [Chen et al. 1997] [Bartz-Beielstein et al. 2011].
Interesting structures and patters in the data, e.g., patterns in Fig. 3 that occur in the interval between
iteration 150 and 200 (on the x-axis, which represents the number of iterations) should be investigated further.

                                                        12
A preprint - June 1, 2021

Furthermore, there seems to be an upper limit for the values of the loss function:
no loss function values appear between 5 and 34 (34 is the loss value of the outlier454 parameter set).
Considering the research goals stated in Sec. 1, the HPT approach presented in this study provides many
tools and solutions. In addition to the research goals (R-1) to (R-8) from Sec. 1, important goals that are
specific for HPT in DNN should be mentioned. Problem and algorithm designs will be treated separately.

Problem Design. The problem design comprehends the set of parameters that related to the problem. In
      HPT and HPO, regression or classification tasks are often considered. In our study, the MNIST data
      set was chosen.
           • A sound test problem specification is necessary, i.e., train, validation, and test sets should be
             clearly specified.
           • Initialization (this is similar to the specification of starting points in optimization)
           • Usage of surrogate benchmarks should be considered (this is similar to the use of CFD simulations
             in optimization)
           • Repeats (power of the test, severity), i.e., how many runs are feasible or necessary?
           • What are meaningful differences (w.r.t. specification of the loss function)?
           • Remember: scientific relevance is not identical to statistical significance.
           • Floor and ceiling effects should be avoided.
           • Comparison to baseline (random search, random sampling, mean value...) is a must.
Algorithm Design. The algorithm design in HPT and HPO refers to the model, i.e., DNNs. In our study,
       the neural network from Sec. 2.3.2 was chosen.
           •   A sound algorithm (neural network) specification us required.
           •   Initialization, pre-training (starting points in optimization). Pre-tuning should be explained.
           •   Hyperparameter (ranges, types) should be clearly specified.
           •   Are there any additional (untunable) parameters?
           •   How is noise noise (randomness, stochasticity) treated?
           •   Is reproducibility ensured?
           •   Last but not least: open source code and open data should be provided.

To conclude: differences between HPT and HPO were discussed. A HPT approach based on SMBO was
introduced and exemplified. It combines two packages from the statistical programming environment R:
tfruns and SPOT, hence providing a HPT environment that is fully accessible from R.
Although HPT can be performed with R functions, the underlying TF (Kears) and Python environment has
be be installed. This installation is explained in the Appendix.

5     Appendix
5.1     Installing Python
5.1.1    Create a Python Environment in the Project
It is recommended that one Python virtual environment is used per experiment. Navigate into the project
directory by using the following command:
cd 

Create a new virtual environment in a folder called python within the project directory using the following
command:
virtualenv python

5.1.2    Activate the Python Environment
The virtualenv can be activated using the following command in a terminal:

                                                       13
A preprint - June 1, 2021

source python/bin/activate

To verify that the correct version of Python was activated the following command can be executed in a
terminal:
which python

5.1.3    Install Python packages in the Environment
Python packages such as numpy, pandas, matplotlib, and other packages can be installed in the Python
virtualenv by using pip install:
pip install numpy pandas matplotlib tensorflow

5.1.4    Install and Configure reticulate to use the Correct Python Version
Install the reticulate package using the following command in the R console:
install.packages("reticulate")

To configure reticulate to point to the Python executable in the virtualenv python from above, create a
file in the project directory called .Rprofile with the following contents:
Sys.setenv(RETICULATE_PYTHON = "python/bin/python")

R must be restarted for the setting to take effect. To check that reticulate is configured for the correct
version of Python the following command can be used in the R console:
reticulate::py_config()

# Get Data:
library("reticulate")
use_virtualenv("~/workspace/bab-bart11o/python")
library("keras")
reticulate::py_config()

5.2     Installing Keras

To get started with Keras, the Keras R package, the core Keras library, and a backend tensor engine (such as
TF) must be installed. This can be done as follows:
install.packages("tensorflow")
install.packages("keras")
library(keras)
install_keras()

This will provide you with a default CPU-based installation of Keras and TF. To install a GPU-based version
of the TF backend engine, the corresponding command reads as follows:
install_keras(tensorflow = "gpu")

5.3     Installing SPOT

The following commands can be used to install the most recent version of SPOT.
install.packages("devtools")
devtools::install_gitlab("bartz/spot",host="http://owos.gm.fh-koeln.de:8055")

                                                    14
A preprint - June 1, 2021

References
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
 M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
 M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
 B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals,
 P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-Scale Machine Learning
 on Heterogeneous Distributed Systems. arXiv e-prints, art. arXiv:1603.04467, Mar. 2016.
P. Balaprakash, M. Salim, T. D. Uram, V. Vishwanath, and S. M. Wild. DeepHyper: Asynchronous hyperpa-
  rameter search for deep neural networks. In 25th IEEE International Conference on High Performance
  Computing (HiPC18). IEEE, 2018. doi: 10.1109/hipc.2018.00014.
T. Bartz-Beielstein. Experimental Research in Evolutionary Computation—The New Experimentalism.
  Natural Computing Series. Springer, Berlin, Heidelberg, New York, 2006. ISBN 3-540-32026-1. doi:
  10.1007/3-540-32027-X. URL http://dx.doi.org/10.1007/3-540-32027-X.
T. Bartz-Beielstein and S. Markon. Tuning Search Algorithms for Real-World Applications: A Regression Tree
  Based Approach. In G. W. Greenwood, editor, Proceedings 2004 Congress on Evolutionary Computation
  (CEC’04), Portland OR, pages 1111–1118, Piscataway NJ, 2004. IEEE. ISBN 0-7803-8515-2. doi: 10.
  1109/CEC.2004.1330986. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=
  1330986.
T. Bartz-Beielstein and M. Zaefferer. Model-based methods for continuous and discrete global optimization.
  Applied Soft Computing, 55:154 – 167, 2017. ISSN 1568-4946. doi: 10.1016/j.asoc.2017.01.039. URL
  http://www.sciencedirect.com/science/article/pii/S1568494617300546.
T. Bartz-Beielstein, C. Lasarczyk, and M. Preuss. Sequential Parameter Optimization. In B. McKay et al.,
  editors, Proceedings 2005 Congress on Evolutionary Computation (CEC’05), Edinburgh, Scotland, pages
  773–780, Piscataway NJ, 2005. IEEE Press. ISBN 0-7803-9363-5. doi: 10.1109/CEC.2005.1554761.
T. Bartz-Beielstein, M. Friese, M. Zaefferer, B. Naujoks, O. Flasch, W. Konen, and P. Koch. Noisy optimization
  with sequential parameter optimization and optimal computational budget allocation. In Proceedings of
  the 13th annual conference companion on Genetic and evolutionary computation, pages 119–120, New York,
  NY, USA, 2011. ACM. ISBN 978-1-4503-0690-4. doi: 10.1145/2001858.2001926.
T. Bartz-Beielstein, J. Branke, J. Mehnen, and O. Mersmann. Evolutionary algorithms. Wiley Interdisciplinary
  Reviews: Data Mining and Knowledge Discovery, 4(3):178–195, 2014. ISSN 1942-4795. doi: 10.1002/widm.
  1124. URL http://dx.doi.org/10.1002/widm.1124.
T. Bartz-Beielstein, C. Doerr, J. Bossek, S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Kerschke,
  M. Lopez-Ibanez, K. M. Malan, J. H. Moore, B. Naujoks, P. Orzechowski, V. Volz, M. Wagner, and
  T. Weise. Benchmarking in optimization: Best practice and open issues. arXiv, 07 2020. URL https:
  //arxiv.org/abs/2007.03488.
T. Bartz-Beielstein, M. Zaefferer, and F. Rehbach. In a Nutshell – The Sequential Parameter Optimization
  Toolbox. arXiv e-prints, art. arXiv:1712.04076, Dec. 2021.
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:
  281–305, 2012. ISSN 1532-4435; 1533-7928/e.
J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in
   hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on
  International Conference on Machine Learning - Volume 28, ICML’13, pages I–115–I–123. JMLR.org, 2013.
H. C. Chen, C. H. Chen, L. Dai, and E. Yücesan. New Development of Optimal Computing Budget
  Allocation for Discrete Event Simulation. In S. Andradï ttir, K. J. Healy, D. H. Withers, and B. L. Nelson,
  editors, Proceedings of the 1997 Winter Simulation Conference, pages 334–341, Piscataway NJ, 1997. IEEE
  Computer Society.
F. Chollet and J. Allaire. Deep Learning with R. Manning, 2018.
S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and Efficient Hyperparameter Optimization at Scale,
  2018. URL http://arxiv.org/abs/1807.01774.
N. Hansen. The CMA evolution strategy: a comparing review. In J. A. Lozano, P. Larranaga, I. Inza, and
  E. Bengoetxea, editors, Towards a new evolutionary computation. Advances on estimation of distribution
  algorithms, pages 75–102. Springer, 2006.

                                                     15
A preprint - June 1, 2021

I. Karmanov, M. Salvaris, M. Fierro, and D. Dean. Comparing deep learning frameworks: A rosetta stone
   approach. Microsoft docs, Microsoft, 2018. Last accessed 29-May-2021. https://docs.microsoft.com/de-
   de/archive/blogs/machinelearning/comparing-deep-learning-frameworks-a-rosetta-stone-approach.
R. M. Lewis, V. Torczon, and M. W. Trosset. Direct search methods: Then and now. Journal of Computational
   and Applied Mathematics, 124(1–2):191–207, 2000.
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A Novel Bandit-Based
   Approach to Hyperparameter Optimization. arXiv e-prints, art. arXiv:1603.06560, Mar. 2016.
D. Meignan, S. Knust, J.-M. Frayet, G. Pesant, and N. Gaud. A Review and Taxonomy of Interactive
   Optimization Methods in Operations Research. ACM Transactions on Interactive Intelligent Systems, Sept.
   2015.
H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, M. Urban, M. Burkart, M. Dippel, M. Lindauer, and
   F. Hutter. Towards Automatically-Tuned Deep Neural Networks, pages 135–149. Springer International
   Publishing, Cham, 2019. ISBN 978-3-030-05318-5. doi: 10.1007/978-3-030-05318-5_7. URL https:
   //doi.org/10.1007/978-3-030-05318-5_7.
T. O’Malley, E. Bursztein, J. Long, F. Chollet, H. Jin, L. Invernizzi, et al. Keras Tuner, 2019.
   https://github.com/keras-team/keras-tuner.
J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms.
   arXiv e-prints, art. arXiv:1206.2944, June 2012.
V. Tatsis and K. Parsopoulos. Grid search for operator and parameter control in differential evolution. In 9th
   Hellenic Conference on Artificial Intelligence (SETN 2016), 2016.
J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning.
   arXiv e-prints, art. arXiv:1407.7722, July 2014.
M. Zaefferer and T. Bartz-Beielstein. Efficient global optimization with indefinite kernels. In J. Handl, E. Hart,
   P. R. Lewis, M. López-Ibáñez, G. Ochoa, and B. Paechter, editors, Parallel Problem Solving from Nature –
   PPSN XIV: 14th International Conference, Edinburgh, UK, September 17-21, 2016, Proceedings, pages 69–79.
   Springer International Publishing, Cham, 2016. ISBN 978-3-319-45823-6. doi: 10.1007/978-3-319-45823-6_7.
   URL http://dx.doi.org/10.1007/978-3-319-45823-6_7.

                                                       16
You can also read