Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach

Page created by Joseph Gibson
 
CONTINUE READING
Computational Performance Predictions for Deep Neural Network Training:
                                                                   A Runtime-Based Approach

                                               Geoffrey X. Yu                    Yubo Gao                          Pavel Golikov                 Gennady Pekhimenko
                                             University of Toronto          University of Toronto               University of Toronto            University of Toronto
                                                                                                                                                   Vector Institute
arXiv:2102.00527v1 [cs.LG] 31 Jan 2021

                                                                  Abstract                                      2080Ti [58]) and server-class GPUs (e.g., A100 [66]) all
                                                                                                                the way to specialized accelerators such as the TPU [34],
                                         Deep learning researchers and practitioners usually leverage
                                                                                                                Gaudi [26], IPU [25], and the Cerebras WSE [12]. Having all
                                         GPUs to help train their deep neural networks (DNNs) faster.
                                                                                                                these options offers flexibility to users, but at the same time
                                         However, choosing which GPU to use is challenging both
                                                                                                                can also lead to a paradox of choice: which hardware option
                                         because (i) there are many options, and (ii) users grapple with
                                                                                                                should a researcher or practitioner use to train their DNNs?
                                         competing concerns: maximizing compute performance while
                                                                                                                   A natural way to start answering this question is to first
                                         minimizing costs. In this work, we present a new practical
                                                                                                                consider CUDA-enabled GPUs. This is because they (i) are
                                         technique to help users make informed and cost-efficient GPU
                                                                                                                commonly used in deep learning; (ii) are supported by all
                                         selections: make performance predictions using the help of
                                                                                                                major deep learning software frameworks (PyTorch [73], Ten-
                                         a GPU that the user already has. Our technique exploits the
                                                                                                                sorFlow [1], and MXNet [13]); (iii) have mature tooling sup-
                                         observation that, because DNN training consists of repetitive
                                                                                                                port (e.g., CUPTI [64]); and (iv) are readily available for rent
                                         compute steps, predicting the execution time of a single itera-
                                                                                                                and purchase. In particular, when considering GPUs, we find
                                         tion is usually enough to characterize the performance of an
                                                                                                                that that there are many situations where a deep learning user
                                         entire training process. We make predictions by scaling the
                                                                                                                needs to choose a specific GPU to use for training:
                                         execution time of each operation in a training iteration from
                                         one GPU to another using either (i) wave scaling, a technique          • Choosing between different hardware tiers. In both
                                         based on a GPU’s execution model, or (ii) pre-trained multi-             academia and industry, deep learning users often have ac-
                                         layer perceptrons. We implement our technique into a Python              cess to several tiers of hardware: (i) a workstation with
                                         library called Surfer and find that it makes accurate iteration          a GPU used for development (e.g., 2080Ti), (ii) a private
                                         execution time predictions on ResNet-50, Inception v3, the               GPU cluster that is shared within their organization (e.g.,
                                         Transformer, GNMT, and DCGAN across six different GPU                    RTX6000 [72]), and (iii) GPUs that they can rent in the
                                         architectures. Surfer currently supports PyTorch, is easy to             cloud (e.g., V100 [53]). Each tier offers a different cost,
                                         use, and requires only a few lines of code.                              availability, and performance trade-off. For example, a
                                                                                                                  private cluster might be “free” (in monetary cost) to use,
                                                                                                                  but jobs may be queued because the cluster is also shared
                                         1    Introduction                                                        among other users. In contrast, cloud GPUs can be rented
                                                                                                                  on-demand for exclusive use.
                                         Over the past decade, deep neural networks (DNNs) have seen
                                         incredible success across many machine learning tasks [18,             • Deciding on which GPU to rent or purchase. Cloud
                                         27, 29, 38, 79, 82, 85]—leading them to become widely used               providers make many different GPUs available for rent (e.g.,
                                         throughout academia and industry. However, despite their pop-            P100 [49], V100, T4 [59], and A100 [66]), each with differ-
                                         ularity, DNNs are not always straightforward to use in practice          ent performance at different prices. Similarly, a wide variety
                                         because they can be extremely computationally-expensive to               of GPUs are available for purchase (e.g., 1080Ti [51], TI-
                                         train [17,40,81,91]. This is why, over the past few years, there         TAN V [55], 2080Ti, 3090 [70]) both individually and as a
                                         has been a significant and ongoing effort to bring hardware              part of a pre-built workstation [39]. These GPUs can vary
                                         acceleration to DNN training [12, 25, 26, 34, 62, 66, 68].               up to 6× in price [84] and 6× in peak performance [67].
                                            As a result of this effort, today there is a vast array of
                                         hardware options for deep learning users to choose from                • Determining how to schedule a job in a heterogeneous
                                         for training. These options range from desktop GPUs (e.g.,               GPU cluster. A compute cluster (e.g., operated by a cloud

                                                                                                            1
import surfer
  provider [8,24,45]) may have multiple kinds of GPUs avail-
  able that can handle a training workload. Deciding which                        tracker = surfer.OperationTracker(
  GPU to use for a job will typically depend on the job’s pri-                      origin_device=surfer.Device.RTX2070,
  ority and performance on the GPU being considered [48].                         )

• Selecting alternative hardware configurations. When a                           with tracker.track():
                                                                                      run_my_training_iteration()
  desired GPU is unavailable (e.g., due to capacity constraints
  in the cloud), a user may need to select a different GPU                        trace = tracker.get_tracked_trace()
  with a comparable cost-normalized performance. For exam-                        print("Pred. iter. exec. time: {:.2f} ms".format(
  ple, when training ResNet-50 [27] on Google Cloud [23],                           trace.to_device(surfer.Device.V100).run_time_ms,
                                                                                  ))
  we find that both the P100 and V100 have similar cost-
  normalized throughputs (differing by just 0.8%). If the                         Listing 1: An example of how Surfer can be used to make
  V100 were to be unavailable,1 a user may want to use the                        iteration execution time predictions.
  P100 instead since the total training cost would be similar.
What makes these situations interesting is that there is not                      the execution time of a training iteration on an existing GPU,
necessarily a single “correct” choice. Users make GPU se-                         and then (ii) we scale the measured execution times of each
lections based on whether the performance benefits of the                         individual operation onto a different GPU using either wave
chosen configuration are worth the cost to train their DNNs.                      scaling or pre-trained multilayer perceptrons (MLPs) [21].
But making these selections in an informed way is not easy,                       Wave scaling is a technique that applies scaling factors to
as performance depends on many factors simultaneously:                            the GPU kernels in an operation, based on a mix of the ratios
(i) the DNN being considered, (ii) the GPU being used, and                        between the two GPUs’ memory bandwidth and compute
(iii) the underlying software libraries used during training                      units. We use MLPs for certain operations (e.g., convolution)
(e.g., cuDNN [62], cuBLAS [65]).                                                  where the kernels used differ between the two GPUs; we
   To do this performance analysis today, the common wis-                         describe this phenomenon and the MLPs in more detail in
dom is to either (i) directly measure the computational per-                      Sections 3.2 and 3.4. We believe that using an existing GPU
formance (e.g., throughput) by actually running the training                      to make operation execution time predictions for a different
job on the GPU, or (ii) consult existing benchmarks (e.g.,                        GPU is reasonable because deep learning users often already
MLPerf [40]) published by the community to get a “ballpark                        have a local GPU that they use for development.
estimate.” While convenient, these approaches also have their                        We implement our technique into a Python library that we
own limitations. Making measurements requires users to al-                        call Surfer, and evaluate its prediction accuracy on five DNNs
ready have access to the GPUs they are considering; this may                      that have applications in image classification, machine trans-
not be the case if a user is deciding whether or not to buy or                    lation, and image generation: (i) ResNet-50, (ii) Inception
rent that GPU in the first place. Secondly, benchmarks are                        v3 [83] (iii) the Transformer, (iv) GNMT [88], and (v) DC-
usually only available for a subset of GPUs (e.g., the V100                       GAN [76]. We use Surfer to make iteration execution time
and T4) and only for common “benchmark” models (e.g.,                             predictions across six different GPUs and find that it makes
ResNet-50 [27] and the Transformer [85]). They are not as                         accurate predictions with an average error of 11.8%. Addi-
helpful if you need an accurate estimate of the performance of                    tionally, we present two case studies to show how Surfer can
a custom DNN on a specific GPU (a common scenario when                            be used to help users make accurate cost-efficient GPU selec-
doing deep learning research).                                                    tions according to their needs (Section 5.3).
   In this work, we make the case for a third complementary                          We designed Surfer to be easy and practical to use. With a
approach: making performance predictions. Although pre-                           few lines of Python, users can leverage Surfer to predict the
dicting the performance of general compute workloads can                          potential computational training performance of their DNNs
be prohibitively difficult due to the large number of possible                    on a given GPU (Listing 1). Surfer currently supports Py-
program phases, we observe that DNN training workloads are                        Torch [73] and can be extended to other frameworks as well.
special because they contain repetitive computation. DNN                             In summary, this work makes the following contributions:
training consists of repetitions of the same training iteration,
which means that the performance of an entire training pro-                       • Wave scaling: a new technique that scales the execution
cess can be characterized by just a few training iterations.                        time of a kernel measured on one GPU to a different GPU
   We leverage this observation to build a new technique that                       by using scaled ratios between the (i) number of compute
predicts a DNN’s training iteration execution time on a given                       units on each GPU, and (ii) their memory bandwidths.
GPU using both runtime information and hardware charac-                           • The implementation and evaluation of Surfer: a new li-
teristics. We make predictions in two steps: (i) we measure                         brary that uses wave scaling along with pre-trained MLPs
   1 Inour experience, we often ran into situations where the V100 was              to predict the execution time of DNN training iterations on
unavailable for rent because the cloud provider had an insufficient supply.         different GPUs.

                                                                              2
2     Why Predict Performance?

                                                                          Iter. Execution Time (ms)
                                                                                                      200       Actual                          42.5%
                                                                                                                Predicted
This paper presents a new practical technique for predicting                                          150                               63.1%
                                                                                                                                46.7%
the execution time of a DNN training iteration on different                                           100   55.9%       64.9%
GPUs, with the goal of helping deep learning users make in-
formed cost-efficient GPU selections. However, a common                                               50
first question is to ask why we need to make these perfor-                                              0
mance predictions in the first place. Could other performance                                               V100       2080Ti   P100    2070    P4000
                                                                                                                                GPU
comparison approaches (e.g., simple heuristics or measure-              Figure 1: DCGAN iteration execution time predictions, and
ments) be used instead? In this section, after providing some           their errors, made from the T4 using peak FLOPS ratios be-
background about DNN training, we outline the problems                  tween the devices. Using simple heuristics can lead to high
with these alternative approaches to further motivate the need          prediction errors.
for practical performance predictions.

                                                                        2.2                             Why Not Measure Performance Directly?
2.1    Background on DNN Training                                       Perhaps the most straightforward approach to compare the
                                                                        performance of different GPUs is to just measure the iteration
DNNs, at their heart, are mathematical functions that produce
                                                                        execution time (and hence, throughput) on each GPU when
predictions given an input and a set of learned parameters,
                                                                        training a given DNN. However, this approach also has a
also known as weights [21]. They are built by combining
                                                                        straightforward downside: it requires the user to actually have
together a series of different layers, each of which may contain
                                                                        access to the GPU(s) being considered in the first place. If a
weights. The layers map to mathematical operations. For
                                                                        user is looking to buy or rent a cost-efficient GPU, they would
example, a fully connected layer is implemented using matrix
                                                                        ideally want to know its performance on their DNNs before
multiplication [21]. To produce predictions, a DNN takes
                                                                        spending money to get access to the GPU.
a tensor (an n-dimensional array) as input and applies the
operations associated with each layer in sequence.
                                                                        2.3                             Why Not Apply Heuristics?
Training. A DNN learns its weights in an iterative pro-
cess called training. Each training iteration operates on a             Another approach is to use heuristics based on the hardware
batch of labelled inputs and consists of a forward pass, back-          specifications published by the manufacturer. For example,
ward pass (using backpropagation [77]), and weight update.              one could use the ratio between the peak floating point opera-
The forward and backward passes compute gradients for the               tions per second (FLOPS) of two GPUs or the ratio between
weights, which are then used by an optimization algorithm               the number of CUDA cores on each GPU. The problem with
(e.g., stochastic gradient descent [10] or Adam [37]) to up-            this approach is that these heuristics do not always work.
date the weights so that the DNN produces better predictions.           They assume that a DNN training workload can exhaust all
These steps are repeated until the DNN makes acceptably                 the computational resources on a GPU, which is not true in
accurate predictions.                                                   general [91].
                                                                           To show an example of when simple heuristics do not work
Computational performance. Although conceptually sim-
                                                                        well, we use a GPU’s peak FLOPS to make iteration exe-
ple, prior work has shown that DNN training can be an ex-
                                                                        cution time predictions. We measure the execution time of
tremely time-consuming process [17, 40, 81, 91]. There are
                                                                        a DCGAN training iteration on the T42 and then use this
two primary factors that influence the time it takes a DNN to
                                                                        measurement to predict the iteration execution time on differ-
reach an acceptable accuracy during training [46]: (i) statis-
                                                                        ent GPUs by multiplying by the ratio between the devices’
tical efficiency, and (ii) hardware efficiency. Statistical effi-
                                                                        peak FLOPS. Figure 1 shows the measured and predicted
ciency governs the number of training iterations (i.e., weight
                                                                        execution times on each GPU, along with the prediction error
updates) required to reach a target test accuracy whereas hard-
                                                                        as a percentage. The main takeaway from this figure is that
ware efficiency governs how quickly a training iteration runs.
                                                                        using simple heuristics can lead to high prediction errors; the
In this work, we focus on helping deep learning users make
                                                                        highest prediction error in this experiment is 64.9%, and all
informed cost-efficient hardware configuration selections to
                                                                        the prediction errors are at least 42.5%. In contrast, Surfer
improve their DNN’s hardware efficiency. As a result, we
                                                                        can make these exact same predictions with an average error
compare the performance of different GPUs when training a
                                                                        of 10.2% (maximum 21.8%).
DNN using the time it takes a training iteration to run. This
metric equivalently captures the training throughput for that                          2 We
                                                                                 use a batch size of 128 LSUN [90] synthetic inputs. See Section 5.1
particular DNN.                                                         for details about our methodology.

                                                                    3
2.4    Why Not Use Benchmarks?                                         convolutional, pooling, fully connected, and batch normal-
                                                                       ization [31] layers. This observation reduces the problem of
A third potential approach is to consult published benchmark-          predicting the performance of an arbitrary DNN’s training
ing results [17, 40, 69, 91]. However, the problem with relying        iteration to developing prediction mechanisms for a small set
on benchmarking results is that they are limited to a set of           of operations.
“common” DNNs (e.g., ResNet-50 or the Transformer) and are
usually only available for a small selection of GPUs (e.g., the        Observation 3: Runtime information available. When
T4, V100, and A100). Moreover, benchmarking results also               working on DNNs, users often have a GPU available for use
vary widely among different models and GPUs [40, 69, 91].              in their workstations. These GPUs are used for development
Therefore if no results exist for the GPU(s) a user is consid-         purposes and are not necessarily chosen for the highest per-
ering, or if a user is working with a new DNN architecture,            formance (e.g., 1080Ti [51], TITAN Xp [56]). However, they
there will be no benchmark results for them to consult.                can be used to provide valuable runtime information about
                                                                       the GPU kernels that are used to implement a given DNN. In
                                                                       Section 3.3, we describe how we can leverage this runtime
2.5    Why Not Always Use The “Best” GPU?                              information to predict the performance of the GPU kernels on
Finally, a fourth approach is to always use the most “powerful”        different GPUs (e.g., from a desktop-class GPU such as the
GPU available with the assumption that GPUs are already                2080Ti [58] to a server-class GPU such as the V100 [53, 54]).
priced based on their performance. Why make performance
predictions when the cost-efficiency of popular GPUs should            3.2    Surfer Overview
be the same? However, this assumption is a misconception;
prior work has already shown examples of situations where              Surfer records information at runtime about a DNN train-
it is not true [48, 91]. In this work, we also show additional         ing iteration on a given GPU (Observation 3) and then uses
examples in our case studies (Section 5.3) where (i) cost-effi-        that information to predict the training iteration execution
ciency leads to selecting a different GPU, and (ii) where the          time on a different GPU. Predicting the iteration execution
V100 does not offer significant performance benefits over a            time is enough (Observation 1) to compute metrics about the
common desktop-class GPU (the 2080Ti).                                 entire training process on different GPUs. These predicted
                                                                       metrics, such as the training throughput and cost-normalized
Summary. Straightforward approaches that users might con-              throughput, are then used by end-users (e.g., deep learning
sider to make GPU selections all have their own downsides.             researchers) to make informed hardware selections.
In particular, existing approaches either require access to the           To actually make these predictions for a different GPU,
GPUs themselves or are only applicable for common DNNs                 Surfer predicts the new execution time of each individual
and GPUs. Therefore there is a need for a complementary                operation in a training iteration. Surfer then adds these pre-
approach: making performance predictions—something that                dicted times together to arrive at an execution time prediction
we explore in this work.                                               for the entire iteration. For an individual operation, Surfer
                                                                       makes predictions using either (i) wave scaling (Section 3.3),
                                                                       or (ii) pre-trained MLPs (Section 3.4).
3     Surfer
                                                                          The reason why we use two techniques together is that
Our approach to performance predictions is powered by three            wave scaling assumes that the same GPU kernels are used
key observations. In this section, after describing these obser-       to implement a given DNN operation on each GPU. How-
vations, we outline the key ideas behind Surfer.                       ever, some DNN operations are implemented using different
                                                                       GPU kernels on different GPUs (e.g., convolutions, recurrent
                                                                       layers). This is done for performance reasons as these op-
3.1    Key Observations                                                erations are typically implemented using proprietary kernel
                                                                       libraries that leverage GPU architecture-specific kernels (e.g.,
Observation 1: Repetitive computation. While training a
                                                                       cuDNN [15], cuBLAS [65]). We refer to these operations as
DNN to an acceptable accuracy can take on the order of hours
                                                                       kernel-varying, and scale their execution times to different
to days [17, 40, 91], a single training iteration takes on the
                                                                       GPUs using pre-trained MLPs. Surfer uses wave scaling for
order of hundreds of milliseconds. This observation improves
                                                                       the rest of the operations, which we call kernel-alike.
the predictability of DNN training as we can characterize
the performance of an entire DNN training session using the
performance of a single iteration.                                     3.3    Wave Scaling
Observation 2: Common building blocks among DNNs.                      Wave scaling works by scaling the execution times of the
Although DNNs can consist of hundreds of operations, they              kernels used to implement a kernel-alike DNN operation. The
are built using a relatively small set of unique operations. For       computation performed by a GPU kernel is partitioned into
example, convolutional neural networks typically comprise              groups of threads called thread blocks [20], which typically

                                                                   4
execute in concurrent groups—resulting in waves of execu-                           3.4    MLP Predictors
tion. The key idea behind wave scaling is to compute the
number of thread block waves in a kernel and scale the wave                         To handle kernel-varying operations, Surfer uses pre-trained
execution time using ratios between the origin and destination                      MLPs to make execution time predictions. We treat this pre-
GPUs.                                                                               diction task as a regression problem: given a series of input
                                                                                    features about the operation and a target GPU (described
    We describe wave scaling formally in Equation 1. Let Ti
                                                                                    below), predict the operation’s execution time on that target
represent the execution time of the kernel on GPU i, B the
                                                                                    GPU. We learn an MLP for each kernel-varying operation that
number of thread blocks in the kernel, Wi the number of thread
                                                                                    Surfer currently supports: (i) convolutions (2-dimensional),
blocks in a wave on GPU i, Di the memory bandwidth on GPU
                                                                                    (ii) LSTMs [28], (iii) batched matrix multiplies, and (iv) lin-
i, and Ci the clock frequency on GPU i. Here we let i ∈ {o, d}
                                                                                    ear layers (matrix multiply with an optional bias term). As we
to represent the origin and destination GPUs. By measuring
                                                                                    show in Section 5, relatively few DNN operations are kernel-
To (Observation 3), wave scaling predicts Td using
                                                                                    varying and so training separate MLPs for each of these oper-
                                  γ         1−γ         −1                  ations is a feasible approach. Furthermore, these MLPs can
                B          Do Wd            Co            B                         be used for many different DNNs as these operations are
           Td =                                                      To   (1)
                Wd         Dd Wo            Cd            Wo                        common “building blocks” used in DNNs (Observation 2).
                                                                                    Input features. Each operation-specific MLP takes as input:
where γ ∈ [0, 1] represents the “memory bandwidth bound-
                                                                                    (i) layer dimensions (e.g., the number of input and output
edness” of the kernel. We select γ by measuring the ker-
                                                                                    channels in a convolution); (ii) the memory capacity and
nel’s arithmetic intensity and then leveraging the roofline
                                                                                    bandwidth on the target GPU; (iii) the number of streaming
model [87] (see Section 4.2).
                                                                                    multiprocessors (SMs) on the target GPU; and (iv) the peak
   As shown in Equation 1, wave scaling uses the ratios be-                         FLOPS of the target GPU, specified by the manufacturer.
tween the GPUs’ (i) memory bandwidths, (ii) clock frequen-
cies, and (iii) the size of a wave on each GPU. The intuition                       Model architecture. Each MLP comprises an input layer,
behind factors (i) and (iii) is that a higher relative memory                       eight hidden layers, and an output layer that produces a sin-
bandwidth allows more memory requests to be served in par-                          gle real number—the predicted execution time (this includes
allel whereas having more thread blocks in a wave results                           the forward and backward pass) for the MLP’s associated
in more memory requests being made. Thus, everything else                           operation. We use ReLU activation functions in each layer
held constant, waves in memory bandwidth bound kernels                              and we use 1024 units in each hidden layer. We outline the
(i.e., large γ) should see speedups on GPUs with more mem-                          details behind our datasets and how these MLPs are trained
ory bandwidth. The intuition behind factor (ii) is that higher                      in Section 4.3 and Appendix A.
clock frequencies may benefit waves in compute bound ker-
nels (i.e., small γ).3
   For large dB/Wi e (i.e., when there are a large number of
                                                                                    4     Implementation Details
waves) we get that dB/Wi e ≈ B/Wi . In this case, Equation 1
                                                                                    Surfer is built to work with PyTorch [73]. However, the ideas
simplifies to
                                                                                    behind Surfer are general and can be implemented in other
                             γ         1−γ         1−γ                        frameworks as well. Surfer performs its analysis using a
                         Do          Wo            Co                               DNN’s computation graph, which is also available in other
              Td =                                             To         (2)
                         Dd          Wd            Cd                               frameworks (e.g., TensorFlow [1] and MXNet [13]).

Surfer uses Equation 2 to predict kernel execution times be-
cause we find that in practice, most kernels are composed of                        4.1    Extracting Runtime Metadata
many thread blocks.                                                                 Surfer extracts runtime metadata in a training iteration by
   We can compute Wi for each kernel and GPU using the                              “monkey patching” PyTorch operations with special wrappers.
thread block occupancy calculator that is provided as part                          These wrappers allow Surfer to intercept and keep track of
of the CUDA Toolkit [68]. We obtain Ci from each GPU’s                              all the operations that run in one training iteration, as they
specifications, and we obtain Di by measuring the achieved                          are executed. As shown in Listing 1, users explicitly indicate
bandwidth on each GPU ahead of time. Note that we make                              to Surfer when to start and stop tracking the operations in a
these measurements once and then distribute them in a con-                          DNN by calling track().
figuration file with Surfer.
                                                                                    Execution time. To measure the execution time of each op-
   3 Theclock’s impact on execution time depends on other factors too (e.g.,
                                                                                    eration, Surfer re-runs each operation independently with the
the GPU’s instruction set architecture). Wave scaling aims to be a simple and       same inputs as recorded when the operation was intercepted.
understandable model and therefore does not model these complex effects.            Surfer also measures the execution time associated with the

                                                                                5
This means that γ decreases linearly from 1 to 0.5 as x in-
                                                                             creases toward R. After passing R, γ approaches 0 as x ap-
 Perf. (FLOP/s)

                                                               P
                                                                             proaches infinity.
                  ·x

                                                                             Practical optimizations. In practice, gathering metrics on
                  D

                                                                             GPUs is a slow process because the kernels need to be re-
                                                                     x       played multiple times to capture all the needed performance
                       x1    x2   Arithmetic Intensity (FLOP/byte)
                                                                             counters. To address this challenge, we make two optimiza-
Figure 2: An example roofline model. If a kernel’s arithmetic                tions: (i) we cache measured metrics, keyed by the kernel’s
intensity falls in the shaded region, it is considered memory                name and its launch configuration (number of thread blocks
bandwidth bound (x1 ); otherwise, it is considered compute                   and block size); and (ii) we only measure metrics for oper-
bound (x2 ).                                                                 ations that contribute significantly to the training iteration’s
                                                                             execution time (e.g., with execution times at or above the
                                                                             99.5th percentile). Consequently, when metrics are unavail-
operation’s backward pass, if applicable. Surfer uses CUDA                   able for a particular kernel, we set γ = 1. We believe that this
events [61] to make these timing measurements.                               is a reasonable approximation because kernel-alike operations
Kernel metadata. Surfer uses CUPTI [64] to record execu-                     tend to be very simple (e.g., element-wise operations) and are
tion times for the kernels used to implement each operation                  therefore usually memory bandwidth bound.
in the DNN. This information is used by wave scaling.
                                                                             4.3    MLPs: Data and Training
4.2               Selecting Gamma (γ)                                        Data collection. We gather training data by measuring the
                                                                             execution times of each operation at different configurations
Recall from Section 3.3 that wave scaling scales its ratios us-              on all six of the GPUs listed in Section 5.1. For example,
ing γ, a factor that represents the “memory bandwidth bound-                 for 2D convolutions, we vary the (i) batch size, (ii) number
edness” of a kernel. In this section, we describe in more detail             of input and output channels, (iii) kernel size, (iv) padding,
how Surfer automatically selects γ.                                          (v) stride, and (vi) image size. We select configurations ran-
Roofline model. Wave scaling uses the roofline model [87]                    domly out of the space of all possible configurations. We
to estimate a kernel’s memory boundedness, which it then                     create the final dataset by joining data entries that have the
maps to a value γ ∈ [0, 1]. Figure 2 shows an example roofline               same operation and configuration, but with different GPUs.
model.                                                                       Training. We implement our MLPs using PyTorch. We train
   One key idea behind the roofline model is the notion of                   each MLP for 80 epochs using the Adam optimizer [37] with
a kernel’s arithmetic intensity: the number of floating point                a learning rate of 5 × 10−4 , weight decay of 10−4 , and a batch
operations it performs per byte of data read or written to mem-              size of 512 samples. We reduce the learning rate to 10−4 after
ory (represented by x in Figure 2). The roofline model models                40 epochs. We use the mean absolute percentage error as our
a kernel’s peak performance as the minimum of either the                     loss function:
hardware’s peak performance (P) or the hardware’s memory
bandwidth times the kernel’s arithmetic intensity (D · x) [87].                                 1 n predictedi − measuredi
                                                                                           L=     ∑
This minimum is shown by the solid line in Figure 2.                                            n i=1     measuredi
   A direct consequence of this model is that it considers a
                                                                             We split our datasets by assigning 80% of our samples to the
kernel with an arithmetic intensity of x to be memory bound
                                                                             training set and the rest to our test set. Any configurations that
if x < P/D and compute bound otherwise. For example, in
                                                                             we test on in Section 5 do not appear in our training sets. We
Figure 2, a kernel with an arithmetic intensity of x1 would be
                                                                             normalize the inputs by subtracting the mean and dividing by
considered memory bandwidth bound whereas a kernel with
                                                                             the standard deviation of the input features in our training set
an intensity of x2 would be considered compute bound.
                                                                             (see Appendix A for complete details).
Selecting γ. When profiling each kernel, wave scaling gath-
ers metrics that allow it to empirically calculate the kernel’s
                                                                             5     Evaluation
arithmetic intensity (floating point efficiency, number of bytes
read and written to DRAM). If we let x be the kernel’s mea-                  Surfer is meant to be used by deep learning researchers and
sured arithmetic intensity and R = P/D (using the notation                   practitioners to predict the potential compute performance of
above), we set γ using                                                       a given GPU so that they can make informed cost-efficient
                   (                                                         choices when selecting GPUs for training. Consequently, in
                     (−0.5/R)x + 1 if x < R                                  our evaluation our goals are to determine (i) how accurately
              γ=                                             (3)
                     0.5R/x              otherwise                           Surfer can predict the training iteration execution time on

                                                                         6
Table 1: The GPUs we use in our evaluation.                              Table 3: The DNNs and training configurations we use.

 GPU          Generation   Mem.     Mem. Type       SMs    Rental Cost4           Application       Model               Arch. Type    Dataset

 P4000 [52]                8 GB     GDDR5 [43] 14          –                      Image Classif.    ResNet-50 [27]      Convolution   ImageNet [78]
            Pascal [50]
 P100 [49]                 16 GB    HBM2 [4]   56          $1.46/hr                                 Inception v3 [83]

 V100 [53] Volta [54]      16 GB    HBM2            80     $2.48/hr               Machine Transl.   GNMT [88]           Recurrent     WMT’16 [9]
                                                                                                    Transformer [85]    Attention       (EN-DE)
 2070 [57]                 8 GB  GDDR6 [44] 36             –
 2080Ti [58] Turing [60]   11 GB GDDR6      68             –                      Image Gen.        DCGAN [76]          Convolution   LSUN [90]
 T4 [59]                   16 GB GDDR6      40             $0.35/hr

       Table 2: The machines we use in our evaluation.                           used—to show how Surfer can make predictions for a lower
                                                                                 bound on the computational performance.
 CPU                           Freq.       Cores     Mem.        GPU             Metrics. In our experiments, we measure and predict the
 Xeon E5-2680 v4 [30]          2.4 GHz     14        128 GB      P4000           training iteration execution time—the wall clock time it takes
 Ryzen TR 1950X [5]            3.4 GHz     16        16 GB       2070            to perform one training step on a batch of inputs. We use
 EPYC 7371 [6]                 3.1 GHz     16        128 GB      2080Ti          the training iteration execution time to compute the training
                                                                                 throughput and cost-normalized throughput for our analysis.
                                                                                 The training throughput is the batch size divided by the iter-
GPUs with different architectures, and (ii) whether Surfer can                   ation execution time. The cost-normalized throughput is the
correctly predict the relative cost-efficiency of different GPUs                 throughput divided by the hourly cost of renting the hardware.
when used to train a given model. Overall, we find that Surfer                   Measurements. We use CUDA events to measure the exe-
makes iteration execution time predictions across pairs of six                   cution time of training iterations and DNN operations. We
different GPUs with an average error of 11.8% on ResNet-                         run 3 warm up repetitions, which we discard, and then record
50 [27], Inception v3 [83], the Transformer [85], GNMT [88],                     the average execution time over 3 further repetitions. We use
and DCGAN [76].                                                                  CUPTI [64] to measure a kernel’s execution time.

5.1       Methodology                                                            5.2     How Accurate are Surfer’s Predictions?
Hardware. In our experiments, we use the GPUs listed in                          To evaluate Surfer’s prediction accuracy, we use it to make
Table 1. For the P4000, 2070, and 2080Ti we use machines                         training iteration execution time predictions for ResNet-50,
whose configurations are listed in Table 2. For the T4 and                       Inception v3, the Transformer, GNMT, and DCGAN on all six
V100, we use g4dn.xlarge and p3.2xlarge instances on                             GPUs listed in Section 5.1. Recall that Surfer makes execution
AWS respectively [7]. For the P100, we use Google Cloud’s                        time predictions by scaling the execution time measured on
n1-standard instances [22] with 4 vCPUs and 15 GB of                             one GPU (the “origin” GPU) to another (the “destination”
system memory.                                                                   GPU). As a result, we use all 30 possible (origin, destination)
Runtime environment. We run our experiments inside                               pairs of these six GPUs in our evaluation.
Docker containers [19]. Our container image uses Ubuntu
18.04 [11], CUDA 10.1 [68], and cuDNN 7 [62]. On cloud                           5.2.1   End-to-End Prediction Accuracy
instances, we use the NVIDIA GPU Cloud Image, version
20.06.3 [71]. We use PyTorch 1.4.0 [73] for all experiments.                     Figure 3 shows Surfer’s prediction errors for these afore-
                                                                                 mentioned end-to-end predictions. Each subfigure shows the
Models and datasets. We evaluate Surfer by predicting the                        predictions for all five models on a specific destination GPU.
training iteration execution time for the models listed in Ta-                   We make predictions for three different batch sizes (shown on
ble 3 on different GPUs. For ResNet-50 and Inception v3 we                       the figures) and plot both the predicted and measured iteration
use stochastic gradient descent [10]. We use Adam [37] for                       execution times. Since we consider all possible pairs of our
the rest of the models. We use synthetic data (sampled from                      six GPUs, for each destination GPU we plot the average pre-
a normal distribution) of the same size as samples from each                     dicted execution times among the five origin GPUs. Similarly,
dataset.5 For the machine translation models, we use a fixed                     we show the average prediction error above each bar. From
sequence length of 50—the longest sentence length typically                      these figures, we can draw three major conclusions.
   4 Google  Cloud pricing in us-central1, as of January 2021.
                                                                                    First, Surfer makes accurate end-to-end iteration execution
   5 We verified that the training computation time does not depend on the       time predictions since the average prediction error across all
values of the data itself.                                                       GPUs and models is 11.8%. The average prediction error

                                                                             7
Measured              Predicted

                                                                                                                  Iter. Exec. Time (ms)
 Iter. Exec. Time (ms)

                                                             12.3%
                         400

                                                                                                                                                                            8.9%

                                                                                                                                                                                                    11.7%
                                                                                                                                          400

                                                                                       6.6%

                                                                                                                                                                                        20.9%
                                                                                                                                                       10.3%
                                                                         9.1%

                                                                                                                                                                                                  13.0%
                                                                                                                                                                                     17.8%
                                      9.8%

                                                                       14.4%

                                                                                    8.7%
                                                                        7.7%

                                                                                   10.1%

                                                                                                                                                                                    18.4%

                                                                                                                                                                                                14.3%
                                                     9.3%

                                                                                                                                                                     8.9%
                                  10.1%

                                                  12.0%

                                                                                                                                                                  19.7%
                         200

                                                                                                                                                   9.1%
                                                                                                                                                 10.1%

                                                                                                                                                                                                             13.7%
                                                                                                                                          200

                                                                                                                                                                                                            12.1%
                                8.0%

                                                                                                  11.5%

                                                                                                                                                                                                            20.5%
                                                                                                   7.9%
                                                                                                  7.8%
                           0                                                                                                                0
                                 16 32 64 16 32 64 32 48 64 16 32 48                              64 96128                                        16 32 64 16 32 64 32 48 64 16 32 48                       64 96 128
                                ResNet-50 Inception v3 Transformer GNMT                           DCGAN                                          ResNet-50 Inception v3 Transformer GNMT                    DCGAN
                                                   Model and Batch Size                                                                                             Model and Batch Size
                                                 (a) Predictions onto the V100                                                                                 (b) Predictions onto the 2080Ti
 Iter. Exec. Time (ms)

                                                               14.9%

                                                                                                                  Iter. Exec. Time (ms)
                                                                                                                                          750

                                                                                          16.3%

                                                                                                                                                                            8.3%
                                         26.8%

                         1000

                                                                                                                                                       11.9%

                                                                                                                                                                                                     6.7%
                                                                                       20.9%

                                                                                                                                                                                       10.0%
                                                                           9.4%

                                                                                                                                          500
                                                       13.6%

                                                                                                                                                                                                  8.7%
                                                                                    11.7%

                                                                                                                                                                     5.5%

                                                                                                                                                                                     7.1%
                                    26.1%

                                                                        7.7%

                                                                                                                                                                                    10.5%
                                                                                                                                                   10.4%
                                                                       10.7%

                                                                                                                                                                                                8.6%
                                 19.4%

                                                                                                                                                                  7.4%
                                                    5.7%

                          500

                                                                                                                                                 8.5%

                                                                                                                                                                                                             6.0%
                                                                                                   6.9%

                                                                                                                                                                                                            9.1%
                                                                                                                                          250

                                                                                                  6.9%

                                                                                                                                                                                                            9.6%
                            0                                                                     7.5%                                      0
                                 16 32 64 16 32 64 32 48 64 16 32 48                               64 96128                                       16 32 64 16 32 64 32 48 64 16 32 48                       64 96 128
                                ResNet-50 Inception v3 Transformer GNMT                            DCGAN                                         ResNet-50 Inception v3 Transformer GNMT                    DCGAN
                                                   Model and Batch Size                                                                                             Model and Batch Size
                                                  (c) Predictions onto the T4                                                                                   (d) Predictions onto the 2070
                                                                                                                                          1500
 Iter. Exec. Time (ms)

                                                                                                                  Iter. Exec. Time (ms)
                                                             9.6%

                                                                                                                                                                             7.0%
                         600
                                                                                       6.6%

                                                                                                                                                        15.2%

                                                                                                                                                                                                    6.6%
                                                                           11.0%

                                                                                    10.3%

                                                                                                                                                                                        18.6%
                                      9.7%

                                                                                                                                                                                                 15.1%
                                                                                                                                          1000
                                                                                   13.2%

                                                                                                                                                                                                12.7%
                                                                                                                                                                                     17.4%
                                                                         9.8%
                                                      5.3%

                         400

                                                                                                                                                                      6.3%
                                                                                                                                                    16.1%
                                  11.3%

                                                                                                                                                                                    19.6%
                                                                       6.7%

                                                                                                                                                                   10.2%
                                                   6.0%
                                11.4%

                                                                                                                                                  17.4%

                                                                                                                                                                                                             21.2%
                                                                                                   10.0%

                                                                                                                                                                                                            22.6%
                                                                                                  11.6%

                                                                                                                                                                                                            29.8%
                                                                                                                                           500
                                                                                                  6.0%

                         200
                           0                                                                                                                0
                                 16 32 64 16 32 64 32 48 64 16 32 48                              64 96 128                                       16 32 64 16 32 64 32 48 64 16 32 48                       64 96128
                                ResNet-50 Inception v3 Transformer GNMT                           DCGAN                                          ResNet-50 Inception v3 Transformer GNMT                    DCGAN
                                                   Model and Batch Size                                                                                             Model and Batch Size
                                                 (e) Predictions onto the P100                                                                                 (f) Predictions onto the P4000

                                      Figure 3: Iteration execution time predictions averaged across all other “origin” GPUs we evaluate.

across all ResNet-50, Inception v3, Transformer, GNMT, and                                                               5.2.2                   Operation Breakdown
DCGAN configurations are 13.4%, 9.5%, 12.6%, 11.2%, and
12.3% respectively.                                                                                                      Figure 4 shows a breakdown of the prediction errors for the
                                                                                                                         execution time of individual operations, which are listed on
  Second, Surfer can predict the iteration execution time                                                                the x-axis. The operations predicted using the MLP predictors
across GPU generations, which have different architectures,                                                              are shown on the left (conv2d, lstm, bmm, and linear). Wave
and across classes of GPUs. The GPUs we use span three                                                                   scaling is used to predict the rest of the operations. Above
generations (Pascal [50], Volta [54], and Turing [60]) and                                                               each bar, we also show the importance of each operation as
include desktop, professional workstation, and server-class                                                              a percentage of the iteration execution time, averaged across
GPUs.                                                                                                                    all three DNNs. The prediction errors are averaged among all
   Third, Surfer is general since it supports different types of                                                         pairs of the six GPUs that we evaluate and among ResNet-50,
DNN architectures. Surfer works with convolutional neural                                                                Inception v3, the Transformer, GNMT, and DCGAN. From
networks (e.g., ResNet-50, Inception v3, DCGAN), recurrent                                                               this figure, we can draw two major conclusions.
neural networks (e.g., GNMT), and other neural network ar-                                                                  First, MLP predictors can be used to make accurate predic-
chitectures such as the attention-based Transformer. In partic-                                                          tions for kernel-varying operations as the average error among
ular, Surfer makes accurate predictions for ResNet, Inception,                                                           the conv2d, lstm, bmm, and linear operations is 18.0%. Sec-
and DCGAN despite the significant differences in their archi-                                                            ond, wave scaling can make accurate predictions for important
tectures; ResNet has a “straight-line” computational graph,                                                              operations; the average error for wave scaling predictions is
Inception has a large “fanout” in its graph, and DCGAN is a                                                              29.8%. Although wave scaling’s predictions for some opera-
generative-adversarial model.                                                                                            tions (e.g., __add__, scatter) have high errors, these opera-

                                                                                                              8
200 The percentage above each bar is the operation's relative importance.

                                                                                                               0.3%
  Error (%)

                                                                                                                             0.2%
              100
                           25.8%

                            2.4%

                            1.1%
                          24.4%

                                                                                                                          0.1%
                            9.9%

                                                                                                                         0.2%

                                                                                                                         0.2%
                           0.6%
                           1.3%
                           4.6%

                                                                                                                         0.2%
                                                                                                                         0.2%

                                                                                                                        0.1%
                           8.2%

                           1.1%

                                                                                                                        0.2%
                          1.0%

                          0.8%

                                                                                                                        0.3%
                          0.3%
                          0.3%

                                                                                                                        0.2%
                          0.7%
                          0.8%
                          2.8%

                          0.3%
                          0.6%
                          3.2%

                         0.5%
                         0.6%

                                                                                                                       0.2%
                         1.0%
                         1.4%

                                                                                                                       0.1%
                        0.7%
                        0.8%
                        1.1%

                                                                                                                      0.2%
                0
                           conv2d
                               lstm
                           conv1d
                          convt2d
                            bnorm
                             linear
                          matmul
                          avgpl2d
                                 relu
                              mul_
                             lnorm
                                iadd
                         maxpl2d
                        leakyrelu
                               add_
                                 mul
                           truediv
                          addcdiv
                         addcmul
                         lgsftmax
                                 sqrt
                              norm
                              bmm
                              zero_
                           mskfill
                               imul
                       bxentropy
                         __add__
                                  cat
                            scatter
                                rsub
                           expand
                          dropout
                            contig
                                tanh
                            embed
                          permute
                              mean
                          softmax
                               view
                          zroslike
                                                                              Operation
Figure 4: Operation execution time prediction errors, with importance on top of each bar, averaged across all pairs of evaluated
GPUs and models. The operation names have been shortened and we only show operations with an importance of at least 0.1%.

tions do not make up a significant proportion of the training                                               conv2d                         linear
iteration execution time (having an overall importance of at                                                                  0.13               2         6
most 0.3%).                                                                                     0.14                                             3         7
                                                                                                                                                 4         8

                                                                                   Test Error
                                                                                                0.12                          0.12               5
5.2.3           Mixed Precision Training
                                                                                                0.10                          0.11
In this work, we focus on making accurate cross-GPU ex-
ecution time predictions. As a result, we treat mixed preci-
                                                                                                0.08                          0.10
sion training [42] performance predictions as an orthogonal
problem that can be addressed using existing techniques. For                                                  lstm                         bmm
                                                                                                0.08
example, in a recent work called Daydream, Zhu et al. present                                                                0.080
a technique for predicting the performance benefits of switch-
                                                                                                0.07
                                                                                   Test Error

ing from full to mixed precision training on the same fixed                                                                  0.075
GPU [92]. If users want to know about the performance bene-                                                                  0.070
fits of mixed precision training on a different GPU, they can                                   0.06
use the Daydream techniques in conjunction with Surfer.                                                                      0.065
   To show that this combined approach works in practice,                                       0.05
                                                                                                       26      28      210           26      28      210
we use a P4000 to predict the execution time of a ResNet-50                                                 Layer Size                    Layer Size
mixed precision training iteration on the 2070 and 2080Ti.6                       Figure 5: Test error as we vary the number of layers and their
On the P4000, we first use Surfer to predict the full precision                   sizes in each MLP. The x-axis is in a logarithmic scale.
iteration execution time on the 2070 and 2080Ti. Then, we
apply the Daydream techniques to translate these predicted
full precision execution times into mixed precision execution                     study where we vary the number of hidden layers in each
times. We also repeat this experiment between the 2070 and                        MLP (2 to 8) along with their size (powers of two: 25 to 211 ).
2080Ti. Overall, we find that this combined approach has an                       Figure 5 shows each MLP’s test mean absolute percentage
average error of 16.1% among the P4000, 2070, and 2080Ti                          error after being trained for 80 epochs. From this figure we
(some of this error comes from the Daydream techniques [92]).                     can draw two major conclusions.
Therefore, from these results, we can conclude that Surfer is                        First, increasing the number of layers and their sizes leads
also able to effectively support mixed precision predictions                      to lower test errors. Increasing the size of each layer beyond
on other GPUs.                                                                    29 seems to lead to diminishing returns on each operation.
                                                                                  Second, the MLPs for all four operations appear to follow a
5.2.4           MLPs: How Many Layers?                                            similar test error trend. Based on these results, we can also
                                                                                  conclude that using eight hidden layers is a reasonable choice.
In all our MLPs, we use eight hidden layers, each of size
1024. To better understand how the number of layers affects
the MLPs’ prediction accuracy, we also conduct a sensitivity                      5.3              Can Surfer Help Users Make Correct De-
   6 We use the same experimental setup and batch sizes as described and                           cisions?
shown in Section 5.1 and Figure 3. We compare our iteration execution time
predictions against training iterations performed using PyTorch’s automatic       One of Surfer’s primary use cases is to help deep learning
mixed precision module.                                                           users make informed and cost-efficient GPU selections. In the

                                                                              9
following two case studies, we demonstrate how Surfer can
                                                                                                               Measured                Predicted

                                                                                                                                                         .9%
                                                                                                 5

                                                                         Norm. Throughput
make cost-efficiency predictions that empower users to make

                                                                                                                                                                 .3%
                                                                                                                                                         14

                                                                                                                                                                         %
                                                                                                 4

                                                                                                                                                                        0.4
correct selections according to their needs.

                                                                                                                                                                 13
                                                                                                         .0%
                                                                                                 3

                                                                                                                 %

                                                                                                                           %
                                                                                                     19

                                                                                                                8.0

                                                                                                                          1.4

                                                                                                                                          .8%
                                                                                                                                  %

                                                                                                                                                    %
                                                                                                 2

                                                                                                                                 2.0

                                                                                                                                                   3.5
                                                                                                                                         33
5.3.1   Case Study 1: Should I Rent a Cloud GPU?                                                 1
                                                                                                 0
As mentioned in Section 1, one scenario a deep learning user                                             16      32        48     16    32     48         16      32   48
                                                                                                                P100                    T4                        V100
may face is deciding whether to rent GPUs in the cloud for                                                                       GPU and Batch Size
training or to stick with a GPU they already have locally (e.g.,
in their desktop). For example, suppose a user has a P4000 in                                    (a) GNMT training throughput normalized to the P4000
their workstation and they want to decide whether to rent a

                                                                         Cost Norm. Throughput
                                                                                                 300

                                                                                                                                            .8%
P100, T4, or V100 in the cloud to train GNMT.

                                                                                                                                                     %
                                                                                                                                                   3.5
                                                                                                                                          33
   With Surfer, they can use their P4000 to make predictions                                     200

                                                                                                                                   %
                                                                                                                                  2.0
about the computational performance of each cloud GPU

                                                                                                                                                                  .3%

                                                                                                                                                                         %
                                                                                                                      %

                                                                                                                            %

                                                                                                                                                                        0.4
                                                                                                                 8.0

                                                                                                                           1.4

                                                                                                                                                           .9%
                                                                                                          .0%

                                                                                                                                                                 13
to help them make this decision in an informed way. Fig-                                         100

                                                                                                                                                         14
                                                                                                         19
ure 6a shows Surfer’s throughput predictions for GNMT for
the P100, T4, and V100 normalized to the training throughput                                         0
                                                                                                           16     32        48     16    32     48         16      32  48
on the P4000. Additionally, Figure 6b shows Surfer’s pre-                                                        P100                    T4                       V100
                                                                                                                                  GPU and Batch Size
dicted training throughputs normalized by each cloud GPU’s
rental costs on Google Cloud as shown in Table 1. Note that                                                     (b) GNMT cost normalized throughput
(i) we make all these predictions with the P4000 as the ori-
gin device, (ii) we make our ground truth measurements on               Figure 6: Surfer’s GNMT training throughput predictions for
Google Cloud instances, and (iii) one can also use Surfer for a         cloud GPUs, made using a P4000. The percentage error is
similar analysis for other cloud providers. From these results,         shown above each prediction.
the user can make two observations.
   First, both the P100 and V100 offer training throughput
speedups over the P4000 (up to 2.3× and 4.0× respectively)              and expensive GPU available in the cloud to rent.7 In this case
whereas the T4 offers marginal throughput speedups (up to               study, we show how Surfer can help a user recognize when
1.4×). However, second, the user would also discover that               the V100 does not offer significant performance benefits for
the T4 is more cost-efficient to rent when compared to the              their model.
P100 and V100 as it has a higher cost-normalized throughput.               Suppose a user wants to train DCGAN and already has a
Therefore, if the user wanted to optimize for maximum com-              2080Ti that they can use. They want to find out if they should
putational performance, they would likely choose the V100.              use a different GPU to get better computational performance
But if they were not critically constrained by time and wanted          (training throughput). They can use Surfer to predict the
to optimize for cost, sticking with the P4000 or renting a T4           training throughput on other GPUs. Figure 7 shows Surfer’s
would be a better choice.                                               throughput predictions along with the measured throughput,
   Surfer makes these predictions accurately, with an average           normalized to the 2080Ti’s training throughput. Note that we
error of 10.7%. We also note that despite any prediction errors,        use a batch size of 64 as it is the default batch size in the
Surfer still correctly predicts the relative ordering of these          DCGAN reference implementation [16] and 128 because it is
three GPUs in terms of their throughput and cost-normalized             the size reported by the authors in their paper [76].
throughput. For example, in Figure 6b, Surfer correctly pre-               From this figure, the user would conclude that they should
dicts that the T4 offers the best cost-normalized throughput on         stick to using their 2080Ti as the V100 would not be worth
all three batch sizes. These predictions therefore allow users          renting. The V100 offers marginal throughput improvements
to make correct decisions based on their needs (optimizing              over the 2080Ti (1.1×) while the P100, P4000, 2070, and T4
for cost or pure performance).                                          all do not offer throughput improvements at all. The reason the
                                                                        V100 does not offer any significant benefits over the 2080Ti
                                                                        despite having more computational resources (Table 1) is that
5.3.2   Case Study 2: Is the V100 Always Better?                        DCGAN is a “computationally lighter” model compared to
                                                                        GNMT and so it does not really benefit from a more power-
In the previous case study, Surfer correctly predicts that the          ful GPU. Surfer makes these predictions accurately, with an
V100 provides the best performance despite not being the                average error of 7.7%.
most cost-efficient to rent. This conclusion may lead a naïve
user to believe that the V100 always provides better training                       7 This
                                                                                  is true except for the new A100s, which have only recently become
throughput over other GPUs, given that it is the most advanced          publicly available in the cloud.

                                                                   10
to implement kernel-varying operations (e.g., convolution).
                   1.5       Measured            Predicted

                                                                         3.1%
                                                                                      However, an operation’s execution time is not determined by

                                                                        8.7%
Norm. Throughput

                          11.4%   Throughput on the 2080Ti                            only its number of FLOPs, and using heuristics to select an
                         14.3%
                   1.0

                                                  3.7%
                                                  0.1%
                                                                                      analytical model cannot always capture kernel-varying op-
                                    17.9%
                                    12.9%

                                                             0.7%
                                                             1.5%
                   0.5                                                                erations correctly. This is because proprietary closed-source
                                                                                      kernel libraries (e.g., cuDNN [15, 62], cuBLAS [65]) may
                   0.0                                                                select different kernel(s) to use by running benchmarks on
                         64 128      64 128      64 128        64 128   64 128        the target GPU [33, 63].
                          P100        P4000       2070           T4     V100
                                            GPU and Batch Size
                                                                                      Performance models for compilers. A complementary
  Figure 7: Predicted and measured DCGAN training through-                            body of work on performance modeling is motivated by the
  put normalized to the 2080Ti, with prediction errors above                          needs of compilers: predicting how different implementations
  each bar. Surfer correctly predicts that the V100’s perfor-                         of some functionality perform on the same hardware. These
  mance is not significantly better than the 2080Ti.                                  models were developed to aid in compiling high-performance
                                                                                      (i) graphics pipelines [2], (ii) CPU code [41], and (iii) tensor
                                                                                      operators for deep learning accelerators [14]. These models
 Summary. These case studies show examples of situations                              have fundamentally different goals compared to Surfer, which
 where (i) the GPU offering the highest training throughput is                        is a technique that predicts the performance of different GPUs
 not the same as the most cost-efficient GPU, and where (ii) the                      running the same high-level code.
 V100 does not offer significantly better performance when
 compared to a desktop-class GPU (the 2080Ti). Notably, in                            Repetitiveness of DNN training. Prior work leverages the
 both case studies, Surfer correctly predicts each of these find-                     repetitiveness of DNN training computation to optimize dis-
 ings. As a result, deep learning researchers and practitioners                       tributed training [32,36,47], schedule jobs in a cluster [48,89],
 can rely on Surfer to help them make correct cost-efficient                          and to apply DNN compiler optimizations [80]. The key dif-
 GPU selections according to their needs.                                             ference between these works and Surfer is that they apply
                                                                                      optimizations on the same hardware configuration. Surfer ex-
                                                                                      ploits the repetitiveness of DNN training to make performance
  6                  Related Work                                                     predictions on different hardware configurations.
 The key difference between Surfer and existing DNN per-                              DNN benchmarking. A body of prior work focuses on
 formance modeling techniques for GPUs [35, 74, 75] is in                             benchmarking DNN training [3,17,40,91]. While these works
 how Surfer makes execution time predictions. Surfer takes a                          provide DNN training performance insights, they do so only
 hybrid runtime-based approach; it uses information recorded                          for a fixed set of DNNs and hardware configurations. In con-
 at runtime on one GPU along with hardware characteristics                            trast, Surfer analyzes DNNs in general and provides perfor-
 to scale the measured kernel execution times onto different                          mance predictions on different GPUs to help users make
 GPUs through either (i) wave scaling, or (ii) pre-trained MLPs.                      informed GPU selections.
 In contrast, existing techniques use analytical models [74, 75]
 or rely entirely on machine learning techniques [35]. The key
 advantage of Surfer’s hybrid scaling approach is that wave                           7   Conclusion
 scaling works “out of the box” for all kernel-alike operations
 (i.e. operations implemented using the same kernels on differ-
                                                                                      We present Surfer: a new runtime-based library that uses
 ent GPUs). Ultimately, this advantage means that new analyt-
                                                                                      wave scaling and MLPs as execution time predictors to help
 ical or machine learning models do not have to be developed
                                                                                      deep learning researchers and practitioners make informed
 each time a new kernel-alike operation is introduced.
                                                                                      cost-efficient decisions when selecting a GPU for DNN train-
 DNN performance models for different hardware. There                                 ing. The key idea behind Surfer is to leverage information
 exists prior work on performance models for DNN training                             collected at runtime on one GPU to help predict the execu-
 on both GPUs [35, 74, 75] and CPUs [86], though only the                             tion time of a DNN training iteration on a different GPU. We
 works by Qi et al. and Justus et al. seem to support generic                         evaluate Surfer and find that it makes cross-GPU iteration exe-
 DNNs. As described above, Surfer is fundamentally different                          cution time predictions with an overall average error of 11.8%
 from these works because it takes a hybrid runtime-based                             on ResNet-50, Inception v3, the Transformer, GNMT, and
 approach when making execution time predictions. For ex-                             DCGAN. Finally, we present two case studies where Surfer
 ample in comparison, Paleo [75] (i) makes DNN operation                              correctly predicts that (i) optimizing for cost-efficiency would
 execution time predictions using analytical models based on                          lead to selecting a different GPU for GNMT, and (ii) that the
 the number of floating point operations (FLOPs) in a DNN                             V100 does not offer significant performance benefits over a
 operation, and (ii) uses heuristics to select the kernels used                       common desktop-class GPU (the 2080Ti) for DCGAN.

                                                                                 11
Acknowledgments                                                          [6] Advanced Micro Devices, Inc. AMD EPYC™ 7371 Pro-
                                                                             cessor, 2020. https://www.amd.com/en/products/
We are grateful to the many people who have contributed to                   cpu/amd-epyc-7371.
this work, either through informal discussions and/or by pro-
viding feedback on earlier versions of this paper. In particular         [7] Amazon, Inc. Amazon EC2 Instance Types, 2020.
we thank (in alphabetical order) Moshe Gabel, James Glee-                    https://aws.amazon.com/ec2/instance-types/.
son, Anand Jayarajan, Xiaodan Tan, Alexandra Tsvetkova,
Shang Wang, Qiongsi Wu, and Hongyu Zhu. We also thank                    [8] Amazon, Inc. Amazon SageMaker, 2021. https://
all members of the EcoSystem research group for the stim-                    aws.amazon.com/sagemaker/.
ulating research environment they provide. This work was
                                                                         [9] Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
supported by a Queen Elizabeth II Graduate Scholarship in
                                                                             Yvette Graham, Barry Haddow, Matthias Huck, Anto-
Science and Technology, Vector Scholarship in Artificial Intel-
                                                                             nio Jimeno Yepes, Philipp Koehn, Varvara Logacheva,
ligence, Snap Research Scholarship, and an NSERC Canada
                                                                             Christof Monz, Matteo Negri, Aurelie Neveol, Mari-
Graduate Scholarship – Master’s (CGS M). This work was
                                                                             ana Neves, Martin Popel, Matt Post, Raphael Rubino,
also supported in part by the NSERC Discovery grant, the
                                                                             Carolina Scarton, Lucia Specia, Marco Turchi, Karin
Canada Foundation for Innovation JELF grant, the Connaught
                                                                             Verspoor, and Marcos Zampieri. Findings of the 2016
Fund, and Huawei grants. Computing resources used in this
                                                                             conference on machine translation. In Proceedings of
work were provided, in part, by the Province of Ontario, the
                                                                             the First Conference on Machine Translation (WMT’16),
Government of Canada through CIFAR, and companies spon-
                                                                             2016.
soring the Vector Institute www.vectorinstitute.ai/partners.
                                                                        [10] Léon Bottou. Large-scale machine learning with
References                                                                   stochastic gradient descent. In Proceedings of the 19th
                                                                             International Conference on Computational Statistics
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng                        (COMPSTAT’10), 2010.
     Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
     jay Ghemawat, Geoffrey Irving, Michael Isard, Man-                 [11] Canonical Ltd. Ubuntu 18.04 LTS (Bionic Beaver),
     junath Kudlur, Josh Levenberg, Rajat Monga, Sherry                      2018. http://releases.ubuntu.com/18.04.
     Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,               [12] Cerebras. Cerebras, 2020. https://www.cerebras.
     Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu,                    net.
     and Xiaoqiang Zheng. Tensorflow: A system for large-
     scale machine learning. In Proceedings of the 12th                 [13] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,
     USENIX Symposium on Operating Systems Design and                        Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang,
     Implementation (OSDI’16), 2016.                                         and Zheng Zhang. MXNet: A flexible and efficient
 [2] Andrew Adams, Karima Ma, Luke Anderson, Riyadh                          machine learning library for heterogeneous distributed
     Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner,                   systems. In Proceedings of the 2016 NeurIPS Workshop
     Steven Johnson, Kayvon Fatahalian, Frédo Durand, and                    on Machine Learning Systems, 2016.
     Jonathan Ragan-Kelley. Learning to Optimize Halide                 [14] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang,
     with Tree Search and Random Programs. ACM Trans-                        Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind
     actions on Graphics (TOG), 38(4), 2019.                                 Krishnamurthy. Learning to Optimize Tensor Programs.
 [3] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon                      In Advances in Neural Information Processing Systems
     Wei, and David Brooks. Fathom: Reference workloads                      31 (NeurIPS’18), 2018.
     for modern deep learning methods. In Proceedings of
     the 2016 IEEE International Symposium on Workload                  [15] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch,
     Characterization (IISWC’16), 2016.                                      Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan
                                                                             Shelhamer. cuDNN: Efficient primitives for deep learn-
 [4] Advanced Micro Devices, Inc.        HBM2 -                              ing. CoRR, abs/1410.0759, 2014.
     High Bandwidth Memory-2, 2015.        https:
     //www.amd.com/system/files/documents/high-                         [16] Soumith Chintala. Deep Convolution Generative
     bandwidth-memory-hbm.pdf.                                               Adversarial Networks, 2020. https://github.com/
                                                                             pytorch/examples/tree/master/dcgan.
 [5] Advanced Micro Devices, Inc.     AMD Ryzen
     Threadripper 1950X Processor, 2017.   https:                       [17] Cody Coleman, Deepak Narayanan, Daniel Kang, Tian
     //www.amd.com/en/products/cpu/amd-ryzen-                                Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle
     threadripper-1950x.                                                     Olukotun, Chris Ré, and Matei Zaharia. DAWNBench:

                                                                   12
You can also read