Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Page created by Gerald Chambers
 
CONTINUE READING
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

 Aurick Qiao1,2 , Willie Neiswanger1,2 , Qirong Ho2 , Hao Zhang1,2 , Gregory R. Ganger1 , and Eric P. Xing1,2
 1 Carnegie Mellon University
 2 Petuum, Inc.
arXiv:2008.12260v1 [cs.DC] 27 Aug 2020

 Abstract greatly degrade job performance and resource efficiency. Of
 these training parameters, the batch size and learning rate of
 Pollux improves scheduling performance in deep learning
 a DL job are strongly dependent on its allocation of resources,
 (DL) clusters by adaptively co-optimizing inter-dependent
 making them particularly difficult to decide in advance in
 factors both at the per-job level and at the cluster-wide level.
 shared-resource environments. Furthermore, an allocation of
 Most existing schedulers will assign each job a number of
 resources that can be efficiently utilized by a DL job not only
 resources requested by the user, which can allow jobs to use
 depends on the structure of the model being trained, but also on
 those resources inefficiently. Some recent schedulers choose
 the batch size and learning rate. This co-dependence between
 job resources for users, but do so without awareness of how DL
 the resources, batch size, and learning rate creates a complex
 training can be re-optimized to better utilize those resources.
 web of considerations a user must make in order to configure
 Pollux simultaneously considers both aspects. By observing
 their job for efficient execution and resource utilization.
 each job during training, Pollux models how their goodput
 Fundamentally, an efficiently configured DL job strikes
 (system throughput combined with statistical efficiency)
 a balance between two often opposing desires: (1) system
 would change by adding or removing resources. Leveraging
 throughput, the number of training examples processed per
 these models, Pollux dynamically (re-)assigns resources to
 wall-clock time, and (2) statistical efficiency, the amount of
 maximize cluster-wide goodput, while continually optimizing
 progress made per training example processed.
 each DL job to better utilize those resources.
 System throughput can be increased by increasing the batch
 In experiments with real DL training jobs and with trace-
 size, as illustrated in Fig. 1a. A larger batch size enables higher
 driven simulations, Pollux reduces average job completion
 utilization of more resources (eg. larger number of GPUs).
 time by 25%−50% relative to state-of-the-art DL schedulers,
 However, when the batch size is increased, the learning
 even when all jobs are submitted with ideal resource and
 rate must be re-tuned. Otherwise, statistical efficiency will
 training configurations. Based on the observation that the
 decrease so that the total training time will not be any shorter,
 statistical efficiency of DL training can change over time, we
 wasting the additionally allocated GPUs.
 also show that Pollux can reduce the cost of training large
 models in cloud environments by 25%. Even with an optimally-tuned learning rate, increasing
 the batch size results in faster-decreasing statistical effi-
 ciency [31, 43]. For every distinct allocation of GPUs, there is
 1 Introduction potentially a different batch size that best balances increasing
 system throughput with decreasing statistical efficiency, as
 Deep learning (DL) training has rapidly become a dominant illustrated in Fig. 1b. Furthermore, how quickly the statistical
 workload in many shared resource environments such as efficiency decreases with respect to the batch size depends on
 datacenters and the cloud. DL jobs are resource-intensive the current training progress. A job in a later stage of training
 and long-running, demanding distributed execution using can potentially tolerate 10x or larger batch sizes without de-
 expensive hardware devices (eg. GPUs or TPUs) in order to grading statistical efficiency, than earlier during training [31].
 complete within reasonable amounts of time. To meet this Thus, the best choice of batch size and learning rate depends
 resource demand, dedicated clusters are often provisioned on the resource allocation, which in turn depends on com-
 for deep learning alone [24], with a scheduler that mediates petition from other jobs sharing the cluster. In turn, the best
 resource sharing between many competing DL jobs. choice of resource allocation depends on the chosen batch size.
 However, existing schedulers require users submitting jobs The batch size, learning rate, and therefore the best resource
 to also specify training parameters that, if set incorrectly, can allocation, all depend on the current training progress of the

 1
12000 6000 where w ∈ Rd are the model parameters to be optimized, X
 Throughput (imgs/sec)

 Best batch size (images)
 Batch size Stage of Training
 10000 2048 First half is the training dataset, xi are the individual samples in the
 8000 512 4000 Second half training data, and ` is the loss evaluated at a single sample.
 6000 The loss function can be minimized using stochastic
 2000
 4000 gradient descent (SGD), which repeatedly applies the
 2000 following update until the loss converges to a stable value:
 0
 5 10 15 2 16
 Number of GPUs Number of GPUs w(t+1) = w(t) −ηĝ(t) . (2)
(a) Job scalability (and therefore (b) The most efficient batch size η is known as the learning rate, which is a scalar that controls
resource utilization) depends on can depend on the allocated re- the magnitude of each update, and ĝ(t) is a stochastic gradient
the batch size. sources and the stage of training. estimate of the loss function L , evaluated using a random
Figure 1: Trade-offs between the batch size, resource mini-batch M (t) ⊂ X of the training data, as follows:
scalability, and stage of training (ResNet18 on CIFAR-10).
 1
The learning rate is separately tuned for each batch size. ĝ(t) = ∑ ∇`(w(t) ,xi ). (3)
 |M (t) | x ∈M(t)
 i

job. Therefore, we argue that the choice of resource allocations,
batch sizes, and learning rates are best made collectively and The learning rate η and batch size m = |M (t) | are training
 parameters which are typically chosen by the user.
dynamically by a knowledgeable cluster scheduler.
 This paper presents Pollux, a hybrid resource scheduler that
co-adaptively allocates resources while tuning the batch size 2.1 System Throughput
and learning rate for every DL job in a shared cluster. The system throughput of DL training can be defined as the
F We provide a formulation of goodput for DL jobs, which number of training samples processed per unit of wall-clock
is a performance metric that takes into account both system time. When a DL job is distributed across several nodes, its sys-
throughput and statistical efficiency. A model of goodput can tem throughput is determined by several factors, including (1)
be learned by observing the throughput and statistical behavior the allocation and placement of resources assigned to the job,
during training, and used for predicting the performance of (2) the method of distributed execution and synchronization,
DL jobs given different resource allocations and batch sizes. and (3) the batch size used by the SGD algorithm.
F We design and implement a scheduling architecture that Data-parallel execution. Data-parallelism is a popular
locally tunes the batch size and learning rate for each DL job, method of distributed execution for DL training. The model
and globally optimizes cluster-wide resource allocations using parameters w(t) are replicated across a set of distributed GPUs
a genetic algorithm. Both components actively cooperate with 1,...,K, and each mini-batch M (t) is divided into equal-sized
each other, and operate based on a common goal of goodput (t) (t)
 partitions per node, M1 , ... , MK . Each GPU k computes a
maximization. (t)
F Pollux not only adapts to the changes in statistical efficiency local gradient estimate ĝk using its own partition, as follows:
over time, but can leverage it to reduce the cost of training (t) 1
large models. In cloud environments, Pollux can provision ĝk = (t) ∑ ∇`(w(t) ,xi ). (4)
 |Mk | x ∈M(t)
the right amount of resources at the right time, based on job i k
training progress, to maximize statistical efficiency across the
entire lifetime of a large DL job. These local gradient estimates are then averaged across all
 replicas to obtain the desired ĝ(t) , as defined by Eqn. 3. Finally,
 When compared with state-of-the-art DL schedulers using each node applies the same update using ĝ(t) to obtain the new
realistic DL job traces from Microsoft, we show that Pollux model parameters w(t+1) , as defined by Eqn. 2.
reduces the average job completion time by up to 70%. Even The run-time of each training iteration is determined by
when all jobs are manually tuned beforehand, Pollux reduces two main components. First, the time spent computing the
the average job completion time by 25−50%. In cloud environ- (t)
 local gradient estimates ĝk , which we denote by Tgrad . For
ments, Pollux reduces the cost of training ImageNet by 25%.
 neural networks, this computation is done using backprop-
 agation [41]. Second, the time spent averaging the local
2 Background: Distributed DL Training gradients and synchronizing the model parameters across all
 job replicas, which we denote by Tsync . This synchronization is
Training a deep learning model typically involves minimizing typically done using a collective all-reduce operation [36, 42],
a loss function of the form or a set of parameter servers [7, 8, 21]. Tsync can be influenced
 by the placement of replicas, and is typically smaller when
 1 the replicas are co-located within the same physical node or
 L (w) = `(w,xi ), (1)
 |X| x∑
 i ∈X
 rack, rather than spread across different nodes or racks.

 2
Limitations due to the batch size. The batch size used during the learning rate η with respect to the batch size m is by
training determines the upper limit on the system throughput. using simple scaling rules. For example, the linear scaling
When the number of replicas is increased, each replica will rule [30] prescribes that η be scaled proportionally with m,
process a smaller partition of the overall mini-batch, resulting while the square-root√scaling rule prescribes that η be scaled
in a proportionally smaller Tgrad . On the other hand, Tsync is proportionally with m. Correctly scaling the learning rate
typically dependent on the size of the gradients and model pa- can improve statistical efficiency when training with large
rameters, rather than the batch size or number of replicas. Due batch sizes, and result in orders-of-magnitude improvements
to Amdahl’s Law, no matter how many replicas are used, the to DL scalability and job completion time [15, 37, 47].
run-time of each training iteration is lower bounded by Tsync . However, simple learning rate scaling rules cannot be used
 To overcome this scalability limitation, it is desirable to for predicting the statistical efficiency of a particular batch
increase the batch size, which allows a larger proportion of size and learning rate ahead of time. Thus, they are unable
time to be spent computing the local gradient estimates as to provide the knowledge required by a cluster scheduler to
opposed to synchronizing gradients and parameters over the jointly optimize resource allocations with batch sizes. Instead
network. Thus, using a larger batch size enables higher system of scaling the learning rate by a constant factor for each batch
throughput when scaling to more data-parallel replicas. size, AdaScale [25] scales the learning rate adaptively based
 on ϕt . Suppose a DL training job is run with batch size m0 . If
 the same job is run using a larger batch size m > m0 , then at
2.2 Statistical Efficiency iteration t, AdaScale scales the learning rate by the factor
The statistical efficiency of DL training can be defined as the rt = (ϕt /m0 +1)/(ϕt /m+1), (5)
amount of training progress made per unit of data processed.
Parameters such as batch size and learning rate influence where we describe how ϕt is computed in Sec. 3.1. AdaScale
this quantity. For example, when the batch size is increased, has been shown to outperform the linear scaling rule for a
the efficiency will decrease (by an amount that depends on wide variety of DL models and batch sizes.
how the learning rate is comparatively scaled). Here, we give For resource scheduling, the most important characteristic
background on previous work that expresses the statistical of AdaScale is its predictability. One iteration of AdaScale
efficiency in terms of a quantity called the gradient noise training with batch size m is approximately equivalent to rt
scale. We then describe adaptive scaling rules that have been iterations of training with the original m0 . This property can
developed to set the learning rate with respect to the batch size, be leveraged to measure statistical efficiency during training,
to achieve high statistical efficiency. and to predict its value before scaling to different batch sizes,
 In Pollux, the gradient noise scale will be used to define a as described in Sec. 3.1.
measure of statistical efficiency for a given choice of batch
size, which will allow us to evaluate and optimize the goodput.
Relatedly, adaptive scaling rules will be used to dynamically
 2.3 Existing DL Schedulers
choose an appropriate learning rate with respect to the current We broadly group existing DL schedulers into two categories.
batch size during this optimization. First, non-resource-adaptive schedulers are agnostic to the
Gradient Noise Scale. Previous work [25, 31], has aimed throughput scalability of DL jobs with respect the the amount of
to characterize the statistical efficiency of DL training in allocated resources. For example, Tiresias [16] does not require
terms of the gradient noise scale ϕt , which can be intuitively any knowledge about the throughput of each job if allocated dif-
viewed as a measure of the signal-to-noise ratio of gradient ferent numbers of GPUs. Instead, users specify the number of
across training examples at iteration t [31]. A larger ϕt means GPUs at the time of job submission, which will be fixed for the
that training parameters such as the batch size and learning lifetime of the job. Tiresias may preempt jobs to prevent head-
rate can be increased to higher values with relatively less of-line blocking, and co-locate job replicas for more efficient
reduction of the statistical efficiency. The gradient noise synchronization. Gandiva [46] is another DL scheduler which
scale can vary greatly between different DL models [13]. It requires user-specified number of GPUs, but is able to optimize
is also non-constant and tends to gradually increase during resource utilization through fine-grained time sharing and job
training, by up to 10× or more [31]. Thus, it is possible to packing. Although Gandiva is able to dynamically grow or
attain significantly better statistical efficiency for a large batch shrink the number of GPUs used by a job, it only does so op-
size later on in training. A knowledgeable scheduler like portunistically, and not based on knowledge of job scalability.
Pollux can adaptively scale training parameters for the right Second, only-resource-adaptive schedulers automatically
DL jobs at the right times, to better accommodate for their decide the amount of resources allocated to each job based on
model-dependent and time-varying levels of efficiency. how well they can be utilized to speed up the job. For example,
Learning Rate Scaling and Adascale SGD. If the batch Optimus [38] learns a predictive model for the system through-
size is increased, the learning rate must also be scaled up put of each job given various amounts of resources, and
to maintain a high statistical efficiency. One way to adjust optimizes cluster-wide resource allocations to minimize the

 3
average job completion time. SLAQ [49], which was not eval- 1.00 1.0

 Stat. efficiency
 Stat. efficiency
uated on deep learning, uses a similar technique to minimize 0.75 0.8
the average loss values for training general ML models. 0.50 Batch size
 800 0.6 Actual
 All existing schedulers are agnostic to the statistical 0.25 8000 Model
efficiency of DL training. The batch size and learning rate 0 20 40 60 80 103 104
are left for the users to tune at the application-level, which as Epochs (statistical) Batch size (images)
previously discussed, can be difficult to do efficiently.
 (a) Efficiency vs training progress (b) Actual efficiency vs predicted
 using different batch sizes. efficiency using Eqn. 7.
3 The Goodput of DL Training
 Figure 2: Statistical efficiency for training ResNet-50 on
We present how goodput1 can be defined, measured, and ImageNet. In Fig. 2a, each “statistical epoch” measures the
predicted for DL jobs, taking into account both system same training progress across different batch sizes. In Fig. 2b,
throughput and statistical efficiency. Goodput predictions are predictions are based on the value of ϕt measured using a
leveraged by Pollux to jointly optimize cluster-wide resource batch size of 4000 images at epoch 15.
allocations and batch sizes.
Definition 3.1. (Goodput) The goodput of a DL job at iteration versus m0 . In Appendix A, we show an equivalence between
t is the product between its system throughput and its statistical the formulations of efficiency given by both papers, and we can
efficiency at iteration t. use these to write a concrete measure of statistical efficiency as
 GOODPUTt (a,m) = THROUGHPUT(a,m)×EFFICIENCYt (m) EFFICIENCYt (m) = rt m0 /m = (ϕt +m0 )/(ϕt +m). (7)
 (6)
a ∈ RN is an allocation vector, where an is the number of GPUs Here, we can write the gradient noise scale as ϕt = m0 σt2 /µt2 ,
allocated from node n, and m is the batch size. where σt2 = Var[ĝ(t) ] is the variance and µt2 = |E[ĝ(t) ]|2 the
 squared norm of the gradient at iteration t using batch size m0 .
 An initial batch size m0 and learning rate η0 are selected Intuitively, Eqn. 7 measures the contribution from each train-
by the user when submitting their job. Pollux will start each ing example to the overall progress. If EFFICIENCYt (m) = E,
job using a single GPU, m = m0 , and η = η0 . As the job runs, then (1) 0 < E ≤ 1, and (2) training using batch size m will
Pollux profiles its execution to learn and refine predictive need to process 1/E times as many training examples to make
models for both THROUGHPUT (Sec. 3.2) and EFFICIENCY the same progress as using batch size m0 .
(Sec. 3.1). Using these predictive models, Pollux periodically During training, Pollux estimates the values of σt and
re-tunes a and m for each job, according to cluster-wide µt to compute ϕt , then uses Eqn 7 to predict the statistical
resource availability and performance (Sec. 4.2). The learning efficiency at different batch sizes. The true values of σt
rate η is re-tuned using AdaScale (Sec. 2.2). and µt vary according to the training progress at iteration t,
 Goodput and statistical efficiency are measured relative to thus EFFICIENCYt (m) reflects the lifetime-dependent trends
the initial batch size m0 and learning rate η0 , and Pollux only exhibited by the true statistical efficiency.
considers batch sizes which are at least the initial batch size, Fig. 2a shows an example of statistical efficiency measured
ie. m ≥ m0 . In this scenario, statistical efficiency is always for ImageNet training. First, a larger batch size results in lower
between 0 and 1, and can be interpreted as a percentage statistical efficiency, but the gap narrows during the later phases
relative to the initial batch size m0 . Therefore, goodput is of training. Second, how the statistical efficiency changes
always less than or equal to the throughput, being equal only over time depends on details of the training procedure. In this
if perfect statistical efficiency is achieved. case, the statistical efficiency increases dramatically when the
 learning rate is decayed by a factor of 10 at epochs 30 and 60,
3.1 Modeling Statistical Efficiency following the standard configuration for training ImageNet.
 Fig. 2b shows the measured statistical efficiency at different
The statistical efficiency of a DL job using batch size m ≥ m0 batch sizes, compared with the statistical efficiency predicted
is the amount of progress made per training example using m, by Eqn. 7 using ϕt measured at a fixed batch size. We find
relative to using m0 . This quantity can be framed in terms of close agreement between the measured and predicted values
the gradient noise scale ϕt . Specifically, [31] formulates the across a range of different batch sizes, which indicates
amount of training progress that can be made in a single update that EFFICIENCYt (m) can be used by Pollux to predict the
using a batch of size m relative to the amount of progress made statistical efficiency at a different batch size without needing
with a batch of size m0 . Similarly, Adascale’s [25] rt (Sec. 2.2) to train using that batch size ahead of time.
is also the relative training progress made using batch size m Estimating σt and µt . The standard way of estimating
 1 Our notionof goodput for DL is analogous to the traditional definition
 σt and µt involves calculating the sample variance of the
 (t)
of goodput in computer networking, ie. the fraction of useful throughput. local gradient estimates ĝk at each iteration [25, 31]. This

 4
3000 where N is the number of physical nodes occupied by at least
Images / sec

 Images / sec
 1500 one replica. αlocal local
 sync and βsync are the constant and retrogression
 1250 2000 parameters for when all processes are co-located onto the
 1000 Actual Actual same node. αnode node
 Model 1000 Model sync and βsync are the analogous parameters
 750 for when at least two process are located on different nodes.
 2 4 6 8 1000 2000 3000
 Number of Nodes Batch size (images) Note that our model for Tsync can be extended to account for
 rack-level locality by adding a third pair of parameters.
 (a) Throughput vs. nodes. (b) Throughput vs batch size. Combining Tgrad and Tsync . Modern DL frameworks can
Figure 3: Our throughput model (Eqn. 8) fit to measured partially overlap Tgrad and Tsync by overlapping gradient
throughput values for ImageNet training. computation with network communication [48]. The degree
 of this overlap depends on structures in the specific DL model
can be done efficiently when there are multiple data-parallel being trained, like the ordering and sizes of its layers.
 (t) Assuming no overlap, then Titer = Tgrad +Tsync . Assuming
processes, by using the different values of ĝk already
available on each process. However, this method doesn’t work perfect overlap, then Titer = max(Tgrad ,Tsync ). A realistic value
when there is only a single process. In this particular situation, of Titer is somewhere in between these two extremes. To
Pollux switches to a differenced variance estimator [45] which capture the overlap between Tgrad and Tsync , we model Titer as
uses consecutive gradient estimates ĝ(t−1) and ĝ(t) . 1/γ
 Titer (a,m) = Tgrad (a,m)γ +Tsync (a)γ , (11)

3.2 Modeling System Throughput where γ ≥ 1 is a learnable parameter. Eqn. 11 has the property
To model and predict the system throughput for data-parallel that Titer = Tgrad +Tsync when γ = 1, and smoothly transitions
DL, we aim to predict the time spent per training iteration, towards Titer = max(Tgrad ,Tsync ) as γ → ∞.
Titer , given an allocation vector a and batch size m, and then Fig. 3 shows an example of our THROUGHPUT function fit to
calculate the throughput as measured throughput values for a range of resource allocations
 and batch sizes. We find that our model can represent the
 THROUGHPUT(a,m) = m/Titer (a,m). (8) observed data closely, while varying both the amount of
 resources as well as the batch size. In Sec. 4.1, we describe
We start by separately modeling Tgrad , the time in each how we fit our throughput model to values measured during
iteration spent computing local gradient estimates, and Tsync , training, and then used to predict training throughput for
the time in each iteration spent averaging gradient estimates different resource allocations and batch sizes.
and synchronizing model parameters across all GPUs.
 Modeling Tgrad . The local gradient estimates are computed
using back-propagation, whose run-time scales linearly with 4 Pollux Design and Architecture
the local batch size in each process. Thus, we model Tgrad as
 Compared with existing DL schedulers, Pollux performs
 Tgrad (a,m) = αgrad +βgrad ·m/K, (9) adaptation at two distinct granularities. First, at a job-level
 granularity, Pollux dynamically tunes the batch size and
where m is the overall batch size, K = ∑n an is the number of learning rate for best utilization of the allocated resources.
allocated GPUs, and αgrad and βgrad are learnable parameters. Second, at the cluster-wide granularity, Pollux dynamically
 Modeling Tsync . When K = 1, then no synchronization is re-allocates resources, driven by the goodput of all jobs
needed and Tsync = 0. Otherwise, we model Tsync as a linear sharing the cluster. To achieve this co-adaptivity in a scalable
function of K. We choose choose a linear model because in way, Pollux’s design consists of two primary components.
strict data-parallelism, the amount of data sent and received First, a PolluxAgent runs together with each job. It measures
from each replica is typically constant with respect to the size of the gradient noise scale and system throughput for that job, and
the gradients and/or parameters. We include a linear factor to ac- tunes its batch size and learning rate for efficient utilization
count for performance retrogressions associated with larger K, of its current allocated resources. PolluxAgent periodically
such as increasing likelihood of stragglers or network delays. reports the goodput function of its job to the PolluxSched.
 Since synchronization requires network communication Second, the PolluxSched periodically optimizes the
between replicas, co-location of those replicas on the same resource allocations for all jobs in the cluster, taking into
node can significantly improve Tsync . Thus, we use different account the current statistical efficiency for each job. It uses
parameters depending on the placement. the goodput function to predict a job’s training performance
  when allocated different resources.
 0
  if K = 1 PolluxAgent and PolluxSched co-adapt to each other.
 local local
 Tsync (a,m) = αsync +βsync ·(K −2) if N = 1, K ≥ 2 (10)
  While PolluxAgent adapts each training job to make efficient
  node node
 αsync +βsync ·(K −2) otherwise, use of its allocated resources, PolluxSched dynamically

 5
Scheduler Scheduler by fitting the THROUGHPUT function to observed throughput
 values collected about the job during training.
 Start job
 Model of job
 Dynamically PolluxAgent measures the time taken per iteration, Titer ,
 Pre-empt job re-allocate
 Resume job throughputs
 resources and records the triple (a, m, Titer ) for all combinations of
 Job 1 Job 2
 resource allocations a and batch size m encountered during
 Job 1 Job 2 Profiler Profiler its lifetime. Periodically, PolluxAgent fits the parameters
 θsys to all of the throughput data collected so far. Specifically,
 we minimize the root mean squared logarithmic error
 Node 1 Node 2 Node 1 Node 2
 (RMSLE) between Eqn. 11 and the collected data triples, using
 (a) Non-resource-adaptive. (b) Only-resource-adaptive. L-BFGS-B [50]. We set constraints for each α and β parameter
 to be non-negative, and γ to be in the range [1,10]. PolluxAgent
 To PolluxSched then reports the updated values of θsys and ϕt to PolluxSched.
 PolluxSched ( sys, ) Prior-driven exploration. At the beginning of each job,
 PolluxAgent throughput values have not yet been collected for many
 Dynamically different resource allocations. To ensure that Pollux finds
 Model of job ( sys, )
 re-allocate Tuner &
 goodputs Profiler
 resources
 AdaScale
 efficient resource allocations through systematic exploration,
 we impose several priors which bias θsys towards the belief
 Batch size &
 Job 1 Job 2 Learning rate that throughput scales perfectly with more resources, until
 Agent Agent
 Job Replica Job Replica such resource configurations are explored.
 In particular, we set αlocal
 sync = 0 while the job had not used
 more than one GPU, αlocal local
 sync = βsync = 0 while the job had not
 GPU GPU
 Node 1 Node 2 used more than one node, and βlocal node
 sync = βsync = 0 while the job
 (c) Co-adaptive (Pollux). had not used more than two GPUs. This creates the following
 behavior: each job starts with a single GPU, and is initially
Figure 4: Architecture of Pollux (Fig. 4c), compared with assumed to scale perfectly to more GPUs. PolluxSched is then
existing schedulers which are either non-resource-adaptive encouraged to allocate more GPUs and/or nodes to the job,
(Fig. 4a) or only-resource-adaptive (Fig. 4b). naturally as part of its resource optimization (Sec. 4.2), until
 the PolluxAgent can estimate θsys more accurately. Finally, to
re-allocates each job’s resources, taking into account the prevent a job from being immediately scaled out to arbitrarily
PolluxAgent’s ability to tune its job. many GPUs, we restrict the maximum number of GPUs which
 Fig. 4 illustrates Pollux’s co-adaptive architecture (Fig. 4c), can be allocated to at most twice the maximum number of
compared with existing schedulers which are either non- GPUs the job has been allocated in its lifetime.
adaptive (Fig. 4a), or resource-adaptive without being involved Training job tuning. With θsys , mgns , and m0 , which fully
with statistical efficiency (Fig. 4b). specify the DL job’s GOODPUT function at its current training
 progress, PolluxAgent determines the most efficient batch size,
4.1 PolluxAgent: Job-level Optimization m∗ = argmax GOODPUT(a,m), (13)
 m
An instance of PolluxAgent is started with each training job. where a is the job’s current resource allocation. To perform this
During training, it continually measures the job’s gradient maximization efficiently, we observe that GOODPUT(a,m) is
noise scale and system throughput, and reports them to a unimodal function of m, and use golden-section search [27]
PolluxSched at a fixed interval. It also uses this information to find its maximum.
to determine the most efficient batch size for its job given its Once a new batch size is found, the job will use it for its
current resource allocations, and adapts its job’s learning rate subsequent training iterations, using AdaScale to adapt its
to this batch size using AdaScale. learning rate appropriately. As the job’s statistical efficiency
Online model fitting. In Sec. 3.2, we defined the system changes over time, PolluxAgent will periodically re-evaluate
throughput parameters of a training job as the 7-tuple the most efficient batch size and learning rate.
  
 θsys = αgrad ,βgrad ,αlocal local node node
 sync ,βsync ,αsync ,βsync ,γ , (12)
 4.2 PolluxSched: Cluster-wide Optimization
which are required to construct the THROUGHPUT function. The PolluxSched periodically allocates (or re-allocates)
Together with the gradient noise scale ϕt and initial batch size resources for every job in the cluster. To determine a set of
m0 , the triple (θsys , ϕt , m0 ) specifies the GOODPUT function. efficient cluster-wide resource allocations, it uses a genetic
While m0 is a constant configuration provided by the user, and algorithm to maximize a fitness function which is defined as
ϕt can be computed according to Sec. 3.1, θsys is estimated a weighted mean across speedups for each job:

 6
∑ j w j ·SPEEDUP j (A j ) Mutation. Each element A jn is mutated with probability 1/N,
 FITNESS(A) = . (14)
 ∑ jw j where N is the total number of nodes. Thus, each job suffers
 on average one mutation in each generation. When A jn is
A is an allocation matrix with each row A j being the placement mutated, it is set to a random integer between 0 and the total
vector for a job j, thus A jn is the number of GPUs on node n number of GPUs on node n.
allocated to job j. Intuitively, maximizing FITNESS causes Crossover. When two allocation matrices are crossed over,
PolluxSched to allocate more resources to jobs that achieve a their rows are randomly mixed. In other words, the offspring
high SPEEDUP when provided with many GPUs (i.e. jobs that allocation matrix consists of job allocations which are
scale well). We define the speedup of each job as the factor of randomly selected between its two parent allocation matrices.
goodput improvement using the given placement vector over In each generation, the allocation matrices which participate
using a single process with the optimal batch size, ie. in crossover are picked using tournament selection [32].
 maxm GOODPUT j (A j ,m) Repair. After the mutation and crossover operations, the re-
 SPEEDUP j (A j ) = , (15) sultant allocation matrices may no longer satisfy resource con-
 maxm GOODPUT j (1,m)
 straints, and try to request more GPUs than are available on a
where GOODPUT j is the goodput of job j at its current training node. To address this issue, random elements are decremented
iteration. Our definition of SPEEDUP j has the property that within columns of the allocation matrix that correspond to over-
allocating a single GPU always results in a speedup of 1, capacity nodes, until the GPU resource constraints are satisfied.
and scales sub-linearly with increasing numbers of allocated Penalty. Each time a job is re-allocated to a different set of
GPUs. Similar to the PolluxAgent, PolluxSched performs GPUs, it will need to save a checkpoint of its current model
the maximizations in the numerator and denominator using parameters, and restart using its new allocation of GPUs. This
golden-section search [27]. process can typically take 30s-60s depending on the size of
Job weights. Although PolluxSched can work well with all the model being trained. To prevent an excessive number
weights w j set to 1, we provide the ability to re-weight jobs of re-allocations, when PolluxSched evaluates the fitness
as an additional tuning knob for cluster operators. Since large function for a given allocation matrix, it applies a penalty for
jobs may take up a significant fraction of cluster resources for every job that needs to restart, i.e.
extended amounts of time, smaller jobs may be left with fewer
resources available. Depending on the preference between SPEEDUP j (A j ) ←− SPEEDUP j (A j )−RESTART_PENALTY.
large jobs versus small jobs, a cluster operator can address this
issue by decreasing the weight of jobs according to the amount Interference avoidance. When multiple distributed DL jobs
of GPU-time already spent on them. In particular, we define share a single node, their network usage while synchronizing
the weight w j of job j as gradients and model parameters may interfere with each other,
 causing both jobs to slow down [24]; Xiao et al. [46] report
 GPUTIME_THRES λ
  
 up to 50% slowdown for DL jobs which compete with each
 w j = min 1, . (16)
 GPUTIME( j) other for network resources. PolluxSched mitigates this issue
 by disallowing different distributed jobs (each using GPUs
 The weight of a job is 1 if its current total GPU-time is at across multiple nodes) from sharing the same node.
most GPUTIME_THRES, and decays gradually thereafter. The
 This non-interference constraint is implemented as part of
parameter λ controls the rate of decay, with λ = 0 being no
 the repair step of the genetic algorithm (Sec. 4.2.1), by also
decay, and larger values being faster decay. We further study
 removing distributed jobs from shared nodes until at most one
the effect of job weights in Sec. 5.3.2.
 distributed job is allocated to each node. We study the effects
 of interference avoidance in Sec. 5.3.2.
4.2.1 Genetic Algorithm
Our genetic algorithm operates on a population of distinct 4.2.2 Cloud Auto-scaling
allocation matrices (see Fig. 5). During each generation of
the algorithm, existing allocation matrices are first randomly In cloud computing environments, GPU nodes can be dynam-
mutated, then crossed over to produce offspring allocation ma- ically provisioned during periods of high demand to decrease
trices, and finally modified to satisfy node resource constraints. job completion time, as well as released during periods of low
A constant population size is maintained after each generation demand to decrease cost of cluster resources. Beyond adapting
by discarding the allocation matrices with the lowest objective to the number and sizes of DL jobs, co-adaptive scheduling
values (according to Eqn. 14). After several generations of presents an unique opportunity for cluster auto-scaling.
the genetic algorithm, the allocation matrix with the highest Since many distributed DL jobs tend to increase in statistical
fitness score is applied to the jobs running in the cluster. efficiency as training progresses, it may be more cost-effective
 We describe the genetic algorithm operations in detail to provision more cloud resources during the later iterations
below, which are illustrated in Fig. 5. of a large training job, rather then earlier on.

 7
Job 1 Job 2 Job 3 3 1 0 3 2 0 3 2 0 3 2 0 Job 1 Job 2 Job 3
 FITNESS = 5

 Jobs

 Jobs
 0 2 0 1 2 0 0 2 1 0 1 1
 0 0 3 Mutate 0 0 2 0 0 2 Repair 0 0 2
 ✓
 keep
 Node 1 Node 2 Node 3 Nodes Nodes Node 1 Node 2 Node 3
 Crossover
 Job 1 Job 2 Job 3 Job 1 Job 2 Job 3
 FITNESS = 3
 2 0 0 3 0 0 3 0 0 3 0 0
 ✗

 Jobs

 Jobs
 1 2 0 0 2 1 1 2 0 0 2 0
 0 1 3 Mutate 0 1 3 0 1 3 Repair 0 1 3 discard
 Node 1 Node 2 Node 3 Node 1 Node 2 Node 3
 Nodes Nodes
Figure 5: The mutation, crossover, and repair operations performed by PolluxSched during each generation of its genetic algorithm.

 In order to decide when to request or release nodes in the allocation matrix with the highest fitness score is applied to
cloud, we first define a measure of cluster resource utility for the cluster, the entire population is saved and used to bootstrap
a given allocation matrix as the genetic algorithm in the next scheduling interval. The
 genetic algorithm is implemented using pymoo [6].
 ∑ j SPEEDUP j (A j )
 UTILITY(A) = (17)
 TOTAL_GPUS
 5 Evaluation
Since each SPEEDUP j is at most equal to the number of GPUs
allocated to job j, then ∑ j SPEEDUP j (A j ) ≤ TOTAL_GPUS, and The primary value of Pollux is automatically choosing the
thus 0 ≤ UTILITY(A) ≤ 1. right resource allocations, and adapting the batch size and
 A cluster operator defines a LOW_UTIL_THRES and a learning rate of each job to best utilize those resources. In
HIGH_UTIL_THRES. PolluxSched will request additional comparison, existing schedulers rely on users to specify these
nodes (when UTILITY(A) is high) or release existing nodes training parameters with well-configured values, which is
(when UTILITY(A) is low) until UTILITY(A) falls within unrealistic and often requires expert knowledge about the DL
this range, where A is the current allocations applied in the model structure and statistical behavior during training.
cluster. The cluster operator also defines a MIN_NODES and However, the ability of users to configure their jobs for effi-
MAX_NODES, which limit the size of the cluster. cient training is difficult to quantify. Thus, we compare Pollux
 PolluxSched performs a binary search for the desired with state-of-the-art DL schedulers from two perspectives.
number of nodes, with the assumption that UTILITY decreases First, in Sec. 5.2, we construct a workload with ideally config-
with increasing numbers of nodes. At each step of the ured jobs, and show that Pollux still outperforms the baseline
binary search, PolluxSched runs its genetic algorithm to DL schedulers in this scenario. In Sec. 5.3.1, we show how the
evaluate the UTILITY of the cluster size being queried. In performance of Pollux and the baseline DL schedulers change
the end, the cluster size with a UTILITY value closest to when realistically configured jobs are added to the workload.
(LOW_UTIL_THRES+HIGH_UTIL_THRES)/2 is selected.
 5.1 Methodology
4.3 Implementation
 Unless stated otherwise, we configured PolluxSched to use
PolluxAgent is implemented as a Python library which is a 60s scheduling interval. During each scheduling interval,
imported into DL training code. We integrated PolluxAgent we run the genetic algorithm for 100 generations using a
with PyTorch [36], which uses all-reduce as its gradient population of 100 allocation matrices. We set GPUTIME_THRES
synchronization algorithm. PolluxAgent inserts performance to 4 GPU-hours with λ = 0.5, and RESTART_PENALTY to
profiling code which measures the time taken for each iteration 0.25. We configured PolluxAgent to report up-to-date system
of training, as well as calculating the gradient noise scale. At throughput parameters and gradient statistics every 30s.
a fixed time interval, PolluxAgent fits the system throughput Workload. We constructed our synthetic workload by
model (Eqn. 11) to the profiled metrics collected so far, and sampling jobs from the deep learning cluster traces published
reports the fitted system throughput parameters, along with by Microsoft [24]. Each job has information on its submission
the latest gradient statistics, to PolluxSched. After reporting time, number of GPUs, and duration. However, no information
to PolluxSched, PolluxAgent updates the job’s batch size, by is provided on the model architectures being trained or dataset
optimizing its now up-to-date goodput function (Eqn. 6) with characteristics. Instead, our synthetic workload consists of
its currently allocated resources. the models and datasets described in Table 1.
 PolluxSched is implemented as a service in Kubernetes [2]. We categorized each job in the trace and in Table 1 based on
At a fixed time interval, PolluxSched runs a fixed number of their total GPU-time, calculated as the product between their
generations of its genetic algorithm. It then applies the best duration and number of GPUs, into four categories — Small
allocation matrix to the cluster, by creating and terminating (0 to 1 GPU-hours), Medium (1 to 10 GPU-hours), Large (10
Kubernetes Pods which run the job replicas. Although only the to 100 GPU-hours), and XLarge (100 to 1000 GPU-hours).

 8
Task Dataset Model(s) Validation Metric Category Frac. of Workload
 Image Classification ImageNet [9] ResNet-50 [19] 75% top-1 accuracy X-Large 2%
 Object Detection PASCAL-VOC [11] YOLOv3 [40] 82% mAP score Large 5%
 Speech Recognition CMU-ARCTIC [28] DeepSpeech2 [3] 25% word error Medium 17%
 Image Classification Cifar10 [29] ResNet18 [19] 94% top-1 accuracy Small 38%
 Collaborative Filtering MovieLens [18] NeuMF [20] 71.5% hit rate Small 38%

Table 1: Models and datasets used in our evaluation workload. Each model was trained until the provided validation metric. The
fraction of jobs from each category are chosen according to the public Microsoft cluster traces.

 Tiresias+TunedJobs. Tiresias requires both the number of
 6000 GPUs and batch size be set by the user at the time of job
 Submissions

 4000 submission. We manually tuned the number of GPUs and
 batch sizes for each job in our synthetic workload, as follows.
 2000
 We measured the time per training iteration for each model in
 0 Table 1 using a range of GPU allocations and batch sizes, and
 0 5 10 15 20
 Time (hour) fully trained each model using a range of different batch sizes
Figure 6: Number of job submissions during each hour of the (see Sec. 5.3 for details). We considered a number of GPUs
day in the Microsoft trace. Our primary synthetic workload valid if using the optimal batch size for that number of GPUs
is sampled from the interval between the dashed lines. achieves 50% – 80% of the ideal speedup versus using the
 optimal batch size on a single GPU, where the ideal speedup
We then picked a model and dataset from Table 1 which is in is defined to be equal to the number of GPUs. For each job
the same category as the corresponding job in the trace. in our workload, we then selected the number of GPUs and
 The primary workload used in our evaluation consists of batch size randomly from its set of valid configurations.
160 job submissions randomly sampled from an 8-hour period Our job configurations essentially assume that the users
which includes the peak rate of job submissions during an are highly rational and knowledgeable about the scalability
average 24-hour day. Job submissions peak during the fourth of the models they are training. A speedup of less than 50%
hour, at 3× the rate of the first hour (Fig. 6). of the ideal would lead to under-utilization of resources, and
Testbed. We conduct experiments using a cluster consisting a speedup of more than 80% means the job can still be further
of 16 nodes and 64 GPUs. Each node is an AWS EC2 parallelized efficiently. We emphasize that this assumption
g4dn.12xlarge instance with 4 Tesla T4 GPUs, 48 vCPUs, of uniformly sophisticated users is unrealistic in practice and
192GB memory, and a 900GB SSD. All instances are launched only serves for evaluating the performance of Tiresias in an
within the same placement group. We deployed Kubernetes ideal world.
1.18.2 on this cluster, along with CephFS 14.2.8 to store Optimus+Oracle. For Optimus, we use the same batch sizes
checkpoints for checkpoint-restart elasticity. for each job as for Tiresias. However, since Optimus automati-
Simulator. We built a discrete-time cluster simulator to study cally decides resource allocations, the number of GPUs set for
the effects of the techniques proposed in Pollux. Our simulator each job is ignored. If a job’s batch size does not fit on a single
reproduces the system throughput of the jobs in our workload, GPU, we additionally enforce a minimum number of GPUs,
using different batch sizes, numbers of GPUs, and different such that the total batch size will fit on that many GPUs.
placements of GPUs across nodes. It also reproduces the Like Pollux, Optimus leverages a performance model to
gradient noise scale across the lifetime of each job, enabling predict the throughput of each job given different amounts of
statistical efficiency to be calculated using the simulator. More cluster resources. However, its performance model is specific to
details of our simulator can be found in Sec. 5.3. the parameter server architecture for synchronizing gradients.
 To account for this difference, our implementation of Optimus
 uses our own throughput model as described in Sec. 3.2.
5.2 Testbed Experiments
 Furthermore, Optimus leverages a prediction of the number
We compare Pollux to idealized versions of two state-of-the-art of training iterations until convergence, by fitting a simple
deep learning schedulers, Tiresias [16] and Optimus [38], function to the model’s convergence curve. However, training
as described in Sec. 2.3. Tiresias is non-resource-adaptive, realistic models to an acceptable quality, including the ones in
while Optimus is only-resource-adaptive. Whereas Pollux our workload, requires learning rate decay schedules that make
dynamically adapts the number of GPUs and batch sizes of DL the convergence curve non-smooth and difficult to extrapolate
jobs, Optimus only adapts the number of GPUs, and Tiresias from. To eliminate any mispredictions, and for the purposes
adapts neither. To establish a fair baseline for comparison, we of our evaluation, we run each job ahead of time, and provide
tune the learning rate using AdaScale for all three schedulers. Optimus with the exact number of iterations until completion.

 9
√
 Policy
 Job Completion Time
 Makespan by factors of ≈ 2, up to the largest batch size which fit into the
 Average 99%tile total GPU memory. We did the same for all valid combinations
 Pollux 1.2h 8.8h 20h of 4 to 16 nodes and 1 to 4 GPUs per node, where the same num-
 Optimus+Oracle 1.6h 11h 24h ber of GPUs is allocated from each node. Finally, to simulate
 Tiresias+TunedJobs 2.4h 16h 33h the throughput for a job, we perform a multi-dimensional linear
 interpolation between the nearest configurations we measured.
Table 2: Summary of testbed experiments. Even in the unre- Simulating statistical efficiency. We measured the gradient
alistic scenario where each job is ideally-tuned before being noise scale during training using √a range of batch sizes, spaced
submitted, Pollux still outperforms baseline DL schedulers. geometrically by factors of ≈ 2. To simulate the statistical
 efficiency for a job using a certain batch size, we linearly
5.2.1 Results interpolated its value of the gradient noise scale between the
 two nearest batch sizes we measured.
Even when every job is pre-configured with an ideal number Simulator fidelity. The data we collected about each job en-
of GPUs and batch size, Pollux outperforms both Opti- ables our simulator to reproduce several system effects, includ-
mus+Oracle and Tiresias. Table 2 shows the detailed results. ing the performance impact of different GPU placements. We
Pollux achieves 25% lower average JCT and 17% shorter also simulate the overhead of checkpoint-restarts by injecting
makespan than Optimus+Oracle, and 50% lower average a 30-second delay for each job which has its resources re-
JCT and 39% shorter makespan than Tiresias. We observe allocated. Unless stated otherwise, we do not simulate any net-
similar differences for tail JCTs—Pollux achieves a 20% and work interference between different jobs. Therefore, each dis-
45% lower 99th percentile JCT than Optimus+Oracle and tributed job runs at its full speed, ignoring other distributed jobs.
Tiresias+TunedJobs, respectively. We study the effects of interference in more detail in Sec. 5.3.2.
 On average, we measured that Pollux maintained ≈ 91% Compared with our testbed experiments in Sec. 5.2, we
statistical efficiency across all jobs running in the cluster at find that our simulator is able to reproduce similar factors of
any given time, while Optimus+Oracle and Tiresias could improvement. Our simulated results show that Pollux reduces
only maintain ≈ 74%. Since the jobs in our workload already the average JCT by 26% and 40% over Optimus+Oracle and
use the most efficient fixed batch size, this result indicates Tiresias+TunedJobs, respectively, compared with 25% and
that Pollux is able to adaptively tune the batch size to achieve 50% in our testbed experiments.
better statistical efficiency.
 Furthermore, we find that over the lifetime of each job,
Pollux achieves on average 1.2× and 1.5× higher throughput 5.3.1 Workloads with Realistic Jobs
than Optimus+Oracle and Tiresias+TunedJobs, respectively. In Sec. 5.2, we compared Pollux with state-of-the-art DL sched-
This is due to the fact that Pollux may increases the batch ulers in the scenario that every submitted job is already config-
size to take advantage of more GPUs when allocated. In ured with an ideal number of GPUs and batch size. Without as-
terms of goodput, the improvement for Pollux is wider, being sistance from a system like Pollux, users likely need to try many
on average 1.4× and 2× higher than Optimus+Oracle and different configurations for each job, before finding one that
Tiresias+TunedJobs, respectively. is reasonable. Each trial run by the user is yet another job con-
 figured with a potentially poor choice of GPUs and batch size.
5.3 Simulator Experiments In this section, we compare Pollux with Optimus+Oracle
 and Tiresias using realistic job configurations. For each job, we
We use a cluster simulator in order to evaluate a broader set use the number of GPUs as specified in the Microsoft traces,
of workloads and settings. Our simulator is constructed by which were decided by real users of a DL cluster. We then
measuring the performance and gradient statistics of each select a random batch size which is within a factor of 2 from the
model in Table 1, under many different resource and batch size most efficient batch size for that number of GPUs, according
configurations, and re-playing them for each simulated job. to our simulator data. We constructed several workload traces
This way, we are able to simulate both the system throughput using various mixtures of ideally-tuned jobs (Sec. 5.2), and
and statistical efficiency of the jobs in our workload. user-configured jobs from the Microsoft cluster trace.
 Unless stated otherwise, each experiment in this section Fig. 7 shows the results. First, the performance of Pollux
is repeated 8 times on different traces generated using the is unaffected by the inclusion of the user-configured jobs. This
same duration, number of jobs, and job size distributions as in is simply because both the number of GPUs and the batch size
Sec. 5.2, and we report the average results across all 8 traces. are decided automatically by Pollux, and co-adapted during
Simulating system throughput. We measured the time per training, rather than being set by users at job submission
training iteration for all allocations of GPUs in a 4-node cluster time. On the other hand, the performance of Tiresias degrades
with 4 GPUs each, removing symmetric placements. For each quickly as more user-configured jobs are included in the
allocation, we used a range of batch sizes, spaced geometrically workload. We find that many users requested a small number

 10
4 Decay (λ) Avg. JCT 50%tile JCT 99%tile JCT
 Pollux
 Norm. average JCT
 3.3 0.0 1 1 1
 3 Optimus+Oracle 2.8
 Tiresias 0.5 0.95 0.77 1.05
 2.3 2.1
 2 1.7 1.7 1.9 1.0 0.98 0.68 1.20
 1.4
 1 1.0 1.0 1.0 1.0 Table 3: Job completion time for various degrees of job weight
 decay λ. Results are shown relative to λ = 0, which always
 0
 0% 33% 67% 100% assigns a weight of 1 to every job.
 Ratio of user-configured jobs
Figure 7: Average JCT (relative to Pollux) for workloads
 Avoidance enabled Avoidance disabled

 Norm. average JCT
with increasing ratios of realistic, user-configured jobs. 100%
corresponds to jobs exactly as configured in the Microsoft 1.5 1.40
 1.00 1.00 1.14 1.00
trace. 0% corresponds to the trace used in Sec. 5.2. 1.0 0.98
 5 0.5
 Average JCT (hours)

 Pollux
 4 Optimus+Oracle 0.0
 Tiresias+TunedJobs 0% 25% 50%
 3 Interference slowdown
 2 Figure 9: Average JCT for various degrees of artificial network
 1 contention, interference avoidance enabled vs. disabled.
 0
 0.5x 1.0x 1.5x 2.0x the 50th percentile JCT, while moderately degrading the
 Relative job load 99th percentile JCT. These results show that job weighting
Figure 8: Average JCT for various number of jobs submitted effectively prioritizes smaller jobs to finish quickly ahead of
within an 8-hour period. larger jobs. On the other hand, we find only a small impact
 on the average JCT, which improves by 5% at λ = 0.5.
of GPUs, when they could still have efficiently utilized more Impact of interference avoidance. PolluxSched attempts to
GPUs—especially in the later stages of each job, when the avoid interference between DL jobs by disallowing cluster
statistical efficiency of using larger batch sizes is high. allocations which include two distributed jobs that share the
 The performance of Optimus+Oracle also degraded, but same node. Although this strategy guarantees that network
not by as much as Tiresias. We found that even though resources on each node are reserved for at most one job, it also
Optimus+Oracle is able to allocate more GPUs to each job, constrains the set of feasible cluster allocations. The genetic
the fact that it does not also adapt the batch size means those algorithm may have a more difficult time finding an efficient
extra GPUs are under-utilized in terms of system throughput. cluster allocation. On the other hand, Xiao et al. [46] report
Overall, when all job configurations are derived from the up to 50% slowdown for DL jobs which compete with each
Microsoft cluster traces (corresponding to the 100% group other for network resources.
in Fig. 7), Pollux reduces the average JCT by 50% relative to To evaluate the impact of PolluxSched’s interference
Optimus+Oracle, and by 70% relative to Tiresias. avoidance constraint, we artificially inject various degrees of
 slowdown for distributed jobs sharing the same node. Fig. 9
 shows the results. With interference avoidance enabled, job
5.3.2 Other Effects on Scheduling
 completion time is unaffected by more severe slowdowns,
Sensitivity to load. We compare the performace of Pollux because network contention is completely mitigated. However,
with Optimus+Oracle and Tiresias+TunedJobs for increasing without interference avoidance, job completion times increase
load, in terms of the rate of job submissions. Fig. 8 shows significantly, up to 1.4× when interference slowdown is 50%.
the results. As expected, all three scheduling policies suffer On the other hand, in the ideal scenario when there is zero
longer job completion times as the load is increased. However, slowdown due to interference, PolluxSched with avoidance
the advantage of Pollux is even more noticeable at higher disabled only performs 2% better. This result indicates that
loads. At 2× the load, the average job completion time of PolluxSched is still able to find efficient cluster allocations
Pollux increased by 1.8×, while that of Optimus+Oracle and while obeying the interference avoidance constraint.
Tiresias+TunedJobs increased by 2.0× and 2.6×, respectively.
Impact of job weights. We investigate the impact of job 5.3.3 Cloud Auto-scaling
weights (Eqn 16) by comparing different values for the decay
parameter λ. In particular, λ = 0 represents our baseline, ie. In cloud environments, computing resources can be obtained
when each job is given a weight of 1. Table 3 summarizes and released as required, and users pay for the duration they
the results. We find that increasing λ significantly improves hold onto those resources. In this scenario, it is less likely that

 11
You can also read