Energy Analysis and Optimization for Distributed Data Centers under Electrical Load Shedding

 
CONTINUE READING
Energy Analysis and Optimization for Distributed Data Centers under Electrical Load Shedding
Energy Analysis and Optimization for
Distributed Data Centers under Electrical
             Load Shedding
                                          by

                                 Linfeng Shen
       B.Eng., Beijing University of Posts and Telecommunications, 2019

                  Thesis Submitted in Partial Fulfillment of the
                          Requirements for the Degree of
                                  Master of Science

                                        in the
                            School of Computing Science
                            Faculty of Applied Sciences

                               © Linfeng Shen 2021
                      SIMON FRASER UNIVERSITY
                                   Summer 2021

   Copyright in this work is held by the author. Please ensure that any reproduction
    or re-use is done in accordance with the relevant national copyright legislation.
Energy Analysis and Optimization for Distributed Data Centers under Electrical Load Shedding
Declaration of Committee

Name:           Linfeng Shen

Degree:         Master of Science

Thesis title:   Energy Analysis and Optimization for Distributed
                Data Centers under Electrical Load Shedding

Committee:      Chair: Ouldooz Baghban Karimi
                       Lecturer, School of Computing Science

                Jiangchuan Liu
                Senior Supervisor
                Professor, School of Computing Science

                Qianping Gu
                Supervisor
                Professor, School of Computing Science

                Jie Liang
                Examiner
                Professor, School of Engineering Science

                            ii
Energy Analysis and Optimization for Distributed Data Centers under Electrical Load Shedding
Abstract

The number and scales of data centers have significantly increased in the current digital
world. The distributed data centers are standing out as a promising solution due to the
development of modern applications which need a massive amount of computation resources
and strict response requirements. Their reliability and availability heavily depend on the
electrical power supply. Most of the data centers are equipped with battery groups as
backup power in case of electrical load shedding or power outage due to severe weather
or human-driven factors. The limited numbers and degradation of batteries, however, can
hardly support the servers to finish all the jobs on time. In this thesis, we divide all the
workload in data centers into web jobs and batch jobs. We develop a battery allocation
and workload migration framework to guarantee web jobs are never interrupted and try
to minimize the waiting time of batch jobs simultaneously. Our extensive evaluations show
that our battery allocation and workload migration results can guarantee all the web jobs
uninterrupted and minimize the average waiting time of batch jobs within a limited overall
cost compared to the current practical allocation.
Keywords: Distributed data center; Load shedding; Backup battery; Workload migration

                                             iii
Energy Analysis and Optimization for Distributed Data Centers under Electrical Load Shedding
Dedication

This thesis is for my family and all the people who gave me support during my master’s
degree study.

                                          iv
Energy Analysis and Optimization for Distributed Data Centers under Electrical Load Shedding
Acknowledgements

First and foremost, I would like to express my greatest gratitude to my senior supervisor, Dr.
Jiangchuan Liu, for his constant support and insightful directions throughout my master
study. He teaches me how to become a researcher from a non-experienced starter. His
significant enlightenment, guidance and encouragement on my research are invaluable for
the success of my study in the past two years. He also sets a role model as a researcher and
a professor for me with his excellent academic contributions and humble personality. This
certainly inspires me to continue pursuing a Ph.D. degree.
   I also thank Dr. Qianping Gu and Dr. Jie Liang for serving on my thesis examining
committee. I would also thank Dr. Ouldooz Baghban Karimi for chairing my thesis defense.
   Besides, I must express my gratitude to Dr. Feng Wang, Dr. Yifei Zhu, Dr Xiaoyi Fan
and Dr. Fangxin Wang for helping me with the research as well as many other problems
during my study. I also feel lucky and grateful to have the experience of studying together
with Mr. Yutao Huang, Mr. Yuchi Chen, Mr. Jia Zhao, Mr. You Luo, Mr. Xiangxiang Wang,
Ms. Miao Zhang and other friends at Simon Fraser University.
   Finally, I would like to thank my family and girlfriend for their consistent support.
Words are powerless to express my gratitude.

                                              v
Table of Contents

Declaration of Committee                                                                            ii

Abstract                                                                                           iii

Dedication                                                                                         iv

Acknowledgements                                                                                    v

Table of Contents                                                                                  vi

List of Tables                                                                                    viii

List of Figures                                                                                    ix

1 Introduction                                                                                      1
   1.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         3
         1.1.1   Energy Analysis and Energy Efficient Approach . . . . . . . . . . . .              3
         1.1.2   Workload Management . . . . . . . . . . . . . . . . . . . . . . . . .              4
   1.2   Motivation     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     4
   1.3   Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          4

2 Workload Migration under Electrical Load Shedding                                                 6
   2.1   System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         6
         2.1.1   Workload, Power and Heat . . . . . . . . . . . . . . . . . . . . . . .             7
         2.1.2   Task-Arriving and Server-Service Rates . . . . . . . . . . . . . . . .            10
         2.1.3   Workload Migration . . . . . . . . . . . . . . . . . . . . . . . . . . .          11
   2.2   Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         11
   2.3   Solutions    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    12
         2.3.1   Approximation of Average Waiting Time . . . . . . . . . . . . . . . .             12
         2.3.2   Problem Transformation . . . . . . . . . . . . . . . . . . . . . . . . .          13
         2.3.3   Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        14

3 Battery Allocation against Electrical Load Shedding                                              18
   3.1   System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        20

                                                 vi
3.1.1   Web Jobs and Batch Jobs . . . . . . . . . . . . . . . . . . . . . . . .         20
        3.1.2   Backup Batteries and Servers . . . . . . . . . . . . . . . . . . . . . .        21
  3.2   Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       21
  3.3   Solutions   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   23
        3.3.1   Battery Allocation Framework . . . . . . . . . . . . . . . . . . . . .          23

4 Performance Evaluation                                                                        27
  4.1   Evaluation of Workload Migration Algorithm . . . . . . . . . . . . . . . . .            27
        4.1.1   Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      27
        4.1.2   Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .      28
  4.2   Evaluation of Battery Allocation Framework          . . . . . . . . . . . . . . . . .   31
        4.2.1   Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .        31
        4.2.2   Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .      32

5 Conclusion and Discussion                                                                     34

Bibliography                                                                                    35

                                              vii
List of Tables

 Table 2.1 Notations 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    8

 Table 3.1 Notations 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   19

 Table 4.1 Server configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . .    27

                                            viii
List of Figures

 Figure 1.1   Various deployment types of distributed data centers . . . . . . . .           2

 Figure 2.1   Overall system model of distributed data centers        . . . . . . . . . .    7
 Figure 2.2   Typical air cooling system in distributed DCs . . . . . . . . . . . .          9
 Figure 2.3   Data flow in the model . . . . . . . . . . . . . . . . . . . . . . . . .      10

 Figure 3.1   The backup batteries and the monitor system. . . . . . . . . . . . .          18
 Figure 3.2   Typical discharge voltage versus time characteristics . . . . . . . .         19
 Figure 3.3   Framework of our system model . . . . . . . . . . . . . . . . . . . .         20
 Figure 3.4   Framework of our solution . . . . . . . . . . . . . . . . . . . . . . .       23

 Figure 4.1   Relationship between utilization and different power constraints sit-
              uation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    28
 Figure 4.2   Response time with different tolerance parameter . . . . . . . . . .          29
 Figure 4.3   Runtime with different tolerance parameter        . . . . . . . . . . . . .   29
 Figure 4.4   Comparison of overall response time . . . . . . . . . . . . . . . . . .       30
 Figure 4.5   Comparison of overall migration cost . . . . . . . . . . . . . . . . .        31
 Figure 4.6   Various metrics for a typical data center when equipped with differ-
              ent number of battery groups . . . . . . . . . . . . . . . . . . . . .        32
 Figure 4.7   The comparison of cost and average waiting time in original alloca-
              tion and re-allocation based on our method        . . . . . . . . . . . . .   32

                                           ix
Chapter 1

Introduction

Recent years have witnessed the rapid development of cloud computing. It brings tremen-
dous and pay-as-you-go computation resources to fill the gap between the computing de-
mands of ever-increasing mobile devices and the limited onboard computation power [20],
where data centers (DCs), as the core component of cloud computing, have become the brain
of nowadays digital world to provide storage, computation, and management for those inter-
connected devices over the high-speed Internet. On the other hand, many emerging modern
applications, such as autonomous driving, VR/AR, video analytics, etc., require not only
a massive amount of computation resources but also ultra-low delay to guarantee the fast
response, rendering the conventional centralized DC-based task offloading ineffective. To
this end, distributed data centers stand out as a promising solution. As shown in Fig. 1.1,
distributed DCs are usually server clusters deployed at the edge of the Internet on different
scales, which handle the various tasks from numerous nearby sources and provide instant
feedbacks with lower operation costs and delay compared to centralized data centers.
   Different from centralized DCs, which are usually well attended with sophisticated wa-
ter cooling systems [22], to allow dense deployment in large regions, distributed DCs often
use much less backup power supply and cheaper air cooling systems to reduce the opera-
tional costs. These factors render a key challenge lies in the confluence of the ever-changing
workload demand at each distributed DC and the unreliable power supply therein, particu-
larly due to load shedding (which means the utilities will cut back the supply voltage when
electrical generation and transmission systems cannot meet the demand requirements), as
well as planned or accidental power outages [35], which tends to cause severe service de-
lays or even service interruptions. For example, the government of China limits the power
in several provinces due to the impact of weather or green energy policy [26] [3]. Power
supplies are being cut to some industrial and commercial customers in Hunan and Jiangxi
provinces, where demand has jumped by at least 18% over the previous year. Severe winter
storm causes power outages to about 220,000 utility customers in Texas recently [27]. Load
shedding in South Africa has crippled many data centers and incurs lots of extra costs [29].
It is estimated that a four-and-a-half-hour load shedding could cost a data center 100,000

                                              1
(a) Monolithic                 (b) Edge data center              (c) Mega data center

              Figure 1.1: Various deployment types of distributed data centers

rands. As such, when the power supply becomes constrained, the server utilization of a DC
will be affected accordingly, usually causing the scale-down of the task processing capacity.
Yet if a crowd of tasks happens to arrive during this period, it can easily lead to ineffective
resource usage and delayed task responses.
     To avoid service interruptions, most data centers are equipped with energy-storage bat-
tery groups as backup power. When there is electrical load shedding or power outage, they
will support the servers in a data center as backup power. On the other hand, workload mi-
gration has been proposed as an effective solution to maximize resource efficiency by schedul-
ing tasks of congested DCs to those light-loaded ones. As such, many efforts have been done
to analyze the resource and workload efficiency in the data centers [2] [39] [31] [32]. And
pioneer works have explored many solutions with such considerations as different data cen-
ter capacities, migration cost, resource efficiency and electricity price [17] [37] [40] [28] [25].
Task offloading to the remote cloud has also been examined in the academia [6] [12] [5].
And recently deep learning has been applied to predict the task failure rate therein [15].
     In this thesis, we carefully consider the impacts of electrical load shedding on the dis-
tributed DCs, especially on their server utilization as well as the resulting ineffective resource
usage and degraded QoS, and we further propose to use workload migrations to effectively
tackle such power constraint caused issues therein. To achieve this, we first examine the
relationship between the power constraints and the corresponding server utilization by fully
considering the cooling requirement in real-world distributed data centers. We then develop
a queuing model for the task service and formulate an optimization problem for workload
migration, aiming to minimize both the service response time and the migration cost. We
further propose an effective algorithm to solve the problem, which, in the worst case, is not
3D
 t   greater than the optimal where t is the parameter for the approximation accuracy and
D is the number of distributed DCs. We have conducted extensive simulations to evaluate
our solution. The results demonstrate that when the power supply becomes constrained,
our method can significantly improve the overall response time with over 9% reduction

                                                2
with minimal cost. Then we consider distributed DCs with backup batteries. We divide
all the workload of distributed data centers into web jobs that can not be interrupted and
batch jobs that can be delayed. To better improve the Quality of Service (QoS) of data
centers under electrical load shedding or power outage, we propose a new battery alloca-
tion framework that first guarantees the web jobs are not interrupted and try to minimize
the average waiting time of batch jobs within a limited overall cost. We use the workload
migration algorithm to solve the problem. We have conducted extensive experiments based
on real-world data to evaluate our framework. The results demonstrate that our battery
allocation framework and workload migration algorithm can significantly improve the QoS
of distributed data centers against electrical load shedding within a limited overall cost.

1.1     Related Work
In this section, we introduce some recent works related to our research, including energy
system analysis and workload management of data centers.

1.1.1   Energy Analysis and Energy Efficient Approach
There are many works on energy system analysis and many approaches have been pro-
posed to saving energy of the data center. Dayarathna et al. [9] survey the state-of-the-art
techniques used for energy consumption modeling and prediction for data centers and their
components. They performed a layer-wise decomposition of a data center system’s power
hierarchy. They first divided the components into two types, hardware and software. They
also conducted an analysis of current power models at different layers of the data center sys-
tem in a bottom-up fashion. Chu et al. [7] investigated a typical small container data center
having overhead air supply system to improve the power usage effectiveness (PUE) of data
centers, the effects of air supply flowrate by computer room air handler (CRAH), intake
flowrate of rack cooling fans, air supply grilles and heat load distribution on the cooling
performance. Thompson et al. [34] presented a methodology for optimizing investment in
data center battery storage capacity. Haywood et al. [18] addressed significant data center
issues of growth in numbers of computer servers and subsequent electricity expenditure by
proposing, analyzing and testing a unique idea of recycling the highest quality waste heat
generated by data center servers. Li et al. [24] address the problem of improving the uti-
lization of renewable energy for a single data center by using two approaches: opportunistic
scheduling and energy storage. They find an intermediate solution mixing both approaches
in order to achieve a balance in the energy losses due to different causes such as battery
efficiency and VM migrations due to consolidation algorithms.

                                              3
1.1.2    Workload Management
Workload management is also an everlasting topic of data center and has attracted many
efforts. Zapater et al. [38] presented a model for close-coupled data centers with free cool-
ing, and proposed a technique that jointly allocates workload and controls cooling in a
power-efficient way based on this model. Guo et al. [17] made a thorough analysis of the
newly released dataset from Alibaba’s production cluster. They breakdown into resource
allocation and resource adjustment to better understand the resource efficiency and charac-
teristics of workloads in data center. Forestiero et al. [14] presented a hierarchical approach
for workload management in geographically distributed data centers. This approach is based
on a function that defines the cost of running some workload on the various sites of the
distributed data center. Iqbal et al. [21] propose a secure fog computing paradigm where
roadside units (RSUs) are used to offload tasks to nearby fog vehicles based on repute scores
maintained at a distributed blockchain ledger. The experimental results demonstrate a sig-
nificant performance gain in terms of queuing time, end-to-end delay, and task completion
rate when compared to the baseline queuing-based task offloading scheme. Aljaedi et al. [1]
introduce the design and implementation of OTMEN, a scalable traffic monitoring system
for SDN-enabled data center networks. OTMEN decouples the forwarding and monitoring
configurations in the data plane to relax the controller, while allowing fine-grained flow-level
monitoring at the edge switches. The evaluation results show that OTMEN provides signif-
icant improvements and monitoring overhead reduction compared to the existing solutions.

1.2     Motivation
All the previous works have constructed comprehensive models for data centers and pro-
posed many approaches to save energy or improve efficiency. However, all of them are based
on the assumption that power is always fully charged. To our best of knowledge, there is still
no work about energy analysis and workload management against electrical load shedding
for distributed data centers. We closely investigate the influence of electrical load shedding
in distributed data centers and construct a physical model to estimate the relationship
among power, heat and workload. Based on this information profiling of distributed data
centers against electrical load shedding, we next propose a framework to allocate the ap-
propriate number of battery groups against load shedding and migrate some workload in
this situation to improve the QoS of data centers with minimal cost.

1.3     Thesis Organization
The rest of this thesis is organized as follows:

                                               4
In Chapter 2, we present our physical model of data center and working information
profiling under electrical load shedding. And then we show our system model without backup
batteries and propose our workload migration algorithm to improve QoS of data centers.
   In Chapter 3, we consider a model with battery groups in data centers as backup power.
In this model, we divide the workloads of data centers into two types: web jobs and batch
jobs. We formulate our optimization problem which aims to guarantee web jobs not inter-
rupted while minimizing the waiting time of batch jobs. We propose a new battery allocation
framework combined with the workload migration algorithm to achieve this.
   In Chapter 4, we evaluate the performance of our proposed workload migration algorithm
and battery allocation framework. Through real-world experiments and simulations, our
methods are confirmed to be efficient.
   In Chapter 5, we conclude this thesis and discuss the potential improvement as our
future work.

                                            5
Chapter 2

Workload Migration under
Electrical Load Shedding

In this chapter, we first introduce the overall system model without backup batteries and
then formulate the workload migration problem with the consideration of constrained power
supply. Then we use queuing theory to approximate the tasks’ average response time. At
last we transform the optimization problem into two subproblems and design an algorithm
to solve them efficiently.

2.1     System Model
Fig. 2.1 shows the overall system model of the distributed data centers considered in this
chapter. In particular, a number of distributed data centers are deployed in a region with
different scales and different power supply conditions. Normally, users send tasks to their
closest data center, and these tasks are usually scheduled and executed with the first-come-
first-serve (FCFS) policy [16]. Each data center has a single-channel queue and all the servers
in a data center are homogeneous. However, due to the instability of the power supply, tasks
may suffer from considerable delays under electrical load shedding, where a long response
time will severely affect the quality of service (QoS).
   As each data center has different scales, and the server utilization will be heterogeneous
under different power supply situations, we first model the relationship between the power
supply and server utilization. Since the cooling system is very essential in a data center
and consumes a large amount of energy, we consider both the heat flow and power flow
in this relationship. Then we present our model for task arrival and server service rate to
approximate the average response time. At last, we introduce the workload migration policy
in our model and propose two objectives we cared about in the problem formulation. Table
2.1 shows the notations used in this chapter.

                                                6
Users

                                       ...                                 ...

               Task queue              ...                                 ...

                                         Data center

                                                                    Power Constrain
            Power supply                                                              Power supply
                                               Workload migration

                                   ...                                    ...

                                   ...                                     ...

                Figure 2.1: Overall system model of distributed data centers

2.1.1    Workload, Power and Heat

   To estimate the server utilization under power constraints, we first construct a physical
model of the data center. Data center is an energy-intensive system. There are many devices
in this system that consume lots of energy. Meanwhile, these devices will generate a large
amount of heat. So to protect these devices, a cooling system is essential in the data center.
Fig. 2.2 shows a typical air cooling system used in distributed data centers. The air cooling
system needs to remove the heat generated by the servers and other heat sources and has
to be reliable and available 24/7 with built-in redundancy. Traditional data center cooling
system uses the infrastructure named computer room air conditioner (CRAC) or computer
room air handler (CRAH). The CRAC/CRAH units would pressurize the space below the
raised floor and push cold air through the perforated tiles and into the server intakes. The
temperature of the servers will be cooled in this process and all of these require significant
electrical power.
   Similar to [30], we construct a model with several intermediary data flows and relation-
ships between sub-components in a data center, which is further illustrated in Fig. 2.3. In
general, there are three flows in this model, that is workload, heat and power flow. The
amount of workload will determine the power consumption of the servers. The heat gener-
ated from the servers and the outside temperature will jointly affect the power consumption
of the cooling system. Then the combination of server, cooler and other parts in the data

                                                       7
Table 2.1: Notations 1

           P      power consumption of server
           P0     power consumption of server under electrical load shedding
           C      power consumption of cooling system
           C0     power consumption of cooling system under electrical load shedding
           O      power consumption of other components
           S      total power consumption
           Q      the amount of generated heat
           α      the percentage of power constraints
           T      temperature
           u      percentage of utilization
           A      the area of server surface
           h      the heat transfer coefficient
           V      the velocity of airflow
           D      the set of all the data centers
           si     the number of servers in data center i
           λi     the task-arriving rate in data center i
           λi 0   the task-arriving rate in data center i after workload migration
           µi     the server-service rate in data center i
           µi 0   the server-service rate in data center i under electrical load shedding
           P0i    the probability that no task in queue in data center i
           Wi     the average waiting time in data center i
           Wi 0   the average waiting time in data center i after workload migration
           Ui     the maximum upload speed in data center i
           Li     the maximum download speed in data center i
           cij    the migration cost per unit from data center i to j
           Xij    the volume of workload migrated from data center i to j
           ρi     the utilization of the servers in data center i
           t      the accuracy of approximation in the punish function

center are the total power consumption. The concrete relationship among each part are as
follows:

   The work in [11] shows that a server’s power consumption has a linear relationship with
its utilization as shown below:

                                    P = Pidle + (Ppeak − Pidle )u                           (2.1)

where P is the power consumption of the server and u is the percentage of its utilization.
Pidle and Ppeak are the idle and peak load power consumption of the server. We use Q
to represent the heat generated by the server, which is also roughly the heat that should
be removed by the cooling system. Then we have the following convective heat transfer
equation:

                                                 8
Figure 2.2: Typical air cooling system in distributed DCs

                                   Q = hA(Toutside − Tinside )                               (2.2)

where A is the area of the server surface, h is the heat transfer coefficient, Toutside and Tinside
are the temperature of the environment inside and outside the data center. From [10], we
know h is exponentially related to velocity (V ) of the flow, and the exponent depends on
the type of flow. The forced convection in the server is typically turbulent, which makes the
exponent 54 . We thus have:
                                                    4
                                             h∝V    5                                        (2.3)

   To remove more heat generated by the server, the velocity of airflow should increase at
an even greater rate. The fan laws [23] tell us that this increase is first order, but power
consumption increases to the 3rd power:

                                             V ∝ C3                                          (2.4)

where C is the power consumption of cooling system. Combine the Eq. (2.2), (2.3) and (2.4),
we can get the relationship between power consumption of cooling system C and generated
heat Q:
                                                    5
                                            C ∝ Q 12                                         (2.5)

   Next we consider the data center’s server utilization under power constraints, and the
power supply relationship can be represented like:

                                       P 0 + C 0 + O ≤ αS                                    (2.6)

                                                9
Workload            Total power
                                                 consumption
                              Heat
                              Power

                                                   Power
                                             consumption(others)

                               Power                                     Power
                         consumption(server)                       consumption(cooler)

                              Workload                             Outside temperature

                             Figure 2.3: Data flow in the model

where α is the percentage of power constraints and S is the total power consumption. We
can easily get the total server power consumption P 0 from Eq. (2.1), and the total power
consumption of other components O will not change under power constraints. Then the
only challenge is to get the total power consumption of the cooling system C 0 . From the Eq.
(2.5) we know that the power consumption of the cooling system has a relationship with
the generated heat. Based on this, we can estimate the C 0 if we can know the predicted heat
under a specific situation. If the total amount of generated heat is Qtotal and Q0total with
and without power constraints, respectively, where we use β to denote the ratio between
them, then we have:
                                          Q0total = βQtotal                                (2.7)

   Combine with Eq. (2.5), we can get the relationship between Ctotal
                                                                0     and Ctotal :

                                                      5
                                          0
                                         Ctotal = β 12 Ctotal                              (2.8)

   It is easy to get Ctotal and β in the general situation. So we can easily get Ctotal
                                                                                  0     under
power constraints and calculate the server utilization based on Eq. (2.1) and (2.6).

2.1.2    Task-Arriving and Server-Service Rates

   Let λi denote the task-arriving rate in data center i, and µi denote the mean rate at
which tasks are finished by a single server in the data center i, which is directly proportional
to server utilization. The number of servers in the data center i is represented by si .

                                                 10
Let Wi denote the average waiting time of tasks in data center i which is determined by
the task-arriving rate λi , server-service rate µi , and the number of servers si . Moreover, if µi
and si do not change, the average waiting time should decrease when λi decreases. That is
the reason why power constraints can affect the response time of tasks. In the next section,
we will propose an approach using queue theory to approximate this average waiting time.

2.1.3    Workload Migration

   Our model assumes that all data centers can connect to each other and migrate some
workloads to others. When the workload migration happens between two data centers, both
of their task-arriving rates will change accordingly. In particular, if data center i migrates
some workload to data center j, then its task-arriving rate decreases from λi to (λi − ∆λij ).
The task-arriving rate of data center j will then increase from λj to (λj + ∆λij ).
   Due to the limits of the device capacity and network condition, the migration speed is
constrained in all data centers, which means one data center can only receive or send a
limited amount of workload at the same time. We use Ui and Li to denote the maximum
upload and download speed allowed for workload migration in data center i. The migration
will also incur some cost, and we use cij to denote the cost of migrating one unit of the
workload from data center i to j.

2.2     Problem Formulation
If there are some power outages or constraints in some distributed data centers, the original
task schedule will suffer from low QoS. Tasks in some data centers will have too long response
time while others may still have extra computing resources. To obtain better QoS and the
lower response time, the original workloads may need to be rescheduled with some being
migrated to other distributed data centers. Let Xij denote the volume of workload migrated
from data center i to j. If Xij is negative, it means data center i receives workload from j.
Let D denote the number of distributed data centers considered in a region and λi denote
the original task-arriving rate, then we can get the new task-arriving rate λi 0 in data center
i as below:

                                                      D
                                        λi 0 = λi −                                          (2.9)
                                                      X
                                                            Xij
                                                      j=1

   Our objective is to minimize the response time of all the tasks and minimize the mi-
gration cost simultaneously. Let Wi 0 denote the new average waiting time after workload
migration. The problem can be formulated as

                                                11
D
                                 Min:Response =                      λ i 0 Wi 0       (2.10)
                                                               X

                                                               i=1

                                                 D X
                                                   D
                             Min:Cost =                    max{Xij , 0}cij            (2.11)
                                                 X

                                                 i=1 j=1

   s.t.

                                         Xij = −Xji , ∀i, j                           (2.12)
                                                λi 0
                                                     < 1, ∀i                          (2.13)
                                               si µi
                                                 D
                                                                                      (2.14)
                                                 X
                                   − Li ≤              Xij ≤ Ui , ∀i
                                                 j=1

   Here, constraint (2.13) ensures the model is stable, which means the queue of tasks will
not increase infinitely. Constraint (2.14) ensures the migration limitation is not surpassed.
In the next section, we will introduce our solution to approximate the average waiting time
and get the best migration schedule.

2.3       Solutions
In this section, we first use queuing theory to approximate the tasks’ average response time.
Then we transform the optimization problem into two subproblems and design an algorithm
to solve them efficiently.

2.3.1     Approximation of Average Waiting Time
Assume the task-arriving time and server-service time are both exponentially distributed
[37] [36]. All the tasks in a data center can be modeled as an M/M/c queue. Then we can
get the utilization of the servers in data center i as below:

                                                        λi
                                                ρi =                                  (2.15)
                                                       si µi
where ρi also represents the probability that the server is busy or the proportion of time
that the server is busy. According to the theory of M/M/c queue, the probability that no
task in queue is

                                        i −1
                                       sX
                                               (si ρi )n      (si ρi )si −1
                             P0i = [                     +               ]            (2.16)
                                       n=0
                                                  n!       si !(1 − ρi )

   Therefore, the average waiting time of tasks in data center i can be calculated as

                                                    12
P0i ρi λsi i −1 + si !(1 − ρi )2 µsi i −1
                            Wi =                                                       (2.17)
                                                si !(1 − ρi )2 µsi i

2.3.2    Problem Transformation
Recall that our objectives are minimizing both the response time Eq. (2.10) and overall
migration cost Eq. (2.11). This optimization problem is actually a multi-objective program-
ming problem, where the migration variable Xij must be in a specific range with some
constraints, and there is no efficient known solution to solve it. Here we transform it into
two subproblems and design an algorithm to solve them efficiently.
   The two objectives in our formulated problem are contradictory and thus cannot be
optimized simultaneously. Here we first optimize the response time because for the providers,
guaranteeing the QoS of users is usually the most important. As overall response time is only
determined by the new task-arriving rate λ0 and does not depend on the process to migrate
the workload, all the constraints can be transformed into new forms without variable Xij .
Let λi 0 become the new variable instead of Xij , with the same objective (3.4), the new
constraints can be formulated as:

                                          D               D
                                                 λi 0 =                                (2.18)
                                          X               X
                                                                λi
                                          i=1             i=1

                                 λi − Ui ≤ λi 0 ≤ λi + Li , ∀i                         (2.19)

                                              λi 0 < si µi , ∀i                        (2.20)

   To solve this new problem, we start from the barrier method [4]. We first introduce a
punish function I(u) = −(1/t)log(−u) to remove the inequality constraints, where t > 0
is a parameter that sets the accuracy of the approximation. I(u) is convex, nondecreasing
and differentiable, which makes it feasible for Newton’s method. We will replace u in the
punish function with Eq. (2.19) and Eq. (2.20) and add them to the objective. The basic
idea here is that the punish function will be close to zero when constraints are satisfied and
very large when constraints are not satisfied. So the solution of the new objective will be
closer and closer to the origin problem when t increases. Now the problems become

                                        D
                       Min:f (~λ0 ) =         {tλi 0 Wi − log(λi 0 − λi + Ui )−
                                        X

                                        i=1                                            (2.21)
                                                           0                    0
                                   log(λi + Li − λi ) − log(si µi − λi )}

s.t. Eq. (2.18).

                                                    13
2.3.3    Algorithm Design
We have transformed the original problem to an equality constrained minimization problem,
and we now develop a Newton’s method based algorithm [4] to solve it. Unlike classical
Newton’s method, the start point must be feasible for this problem, and the calculation
of the Newton step must consider the equality constraint. Here we can use the original
task-arriving rate ~λ = {λ1 , λ2 , ..., λD } as the start point since it must be feasible. At this
feasible point ~λ, we replace the objective with its second-order Taylor approximation near
~λ, the objective becomes

                                                                1
                       Min:fˆ(~λ + ∆~λ) = f (~λ) + f 0 (~λ)∆~λ + f 00 (~λ)∆~λ
                                                                              2
                                                                                                (2.22)
                                                                2
where ∆~λ denotes the Newton step. Now this is a quadratic minimization problem with
equality constraints and can be solved analytically. The Newton step ∆~λ is what must be
added to ~λ to solve the problem and it can be calculated by the KKT system for the equality
constrained quadratic optimization problem [4]. The optimality conditions are

                           D                                          D
                                 ∆λi = 0, f 00 (~λ)∆~λ + f 0 (~λ) +         λ̂i = 0             (2.23)
                           X                                          X

                           i=1                                        i=1

      ~
where λ̂ is the associated optimal dual variable for the quadratic problem. Rewrite Eq.
(2.23) in matrix form:

                                                 #        
                                        f 00 (~λ) I ∆~λ   −f 0 (~λ)
                                    "                          "        #

                                                     ~    =                                     (2.24)
                                            I     0   λ̂      0

where I is the identity matrix. We can easily get that f 00 (~λ) = diag(1/λ1 2 , 1/λ2 2 , ..., 1/λD 2 )
and f 0 (~λ) = −(1/λ1 , 1/λ2 , ..., 1/λD ) from Eq. (2.21). The Newton step is defined only at
points for which the KKT matrix is nonsingular. When this KKT matrix is nonsingular,
                                                       ~
there is only one unique optimal primal-dual pair (~λ, λ̂). If the KKT matrix is nonsingular
                                                                                   ~
but the KKT system is solvable, then any solution will yields an optimal pair (~λ, λ̂). If the
KKT system is not solvable, this problem problem is unbounded below or infeasible. If the
KKT system is solvable, we can calculate it efficiently by elimination, i.e., by solving:

                                            −1   ~               −1
                                  If 00 (~λ) I T λ̂ = −If 00 (~λ) f 0 (~λ)                      (2.25)
                           −1    ~
and setting ∆~λ = f 00 (~λ) (I T λ̂ + f 0 (~λ)) or in other words,

                                                             ~                                  (2.26)
                                         ∆~λ = −diag(λ)2 I T λ̂ + ~λ

                                                      14
~
where λ̂ is the solution of

                                                                 ~                                (2.27)
                                                   Idiag(λ)2 I T λ̂ = 0

   After we get the ∆~λ, we can use ~λ + ∆~λ to approximate the optimal solution for the
current t. When the objective f is exactly quadratic, this update ~λ + ∆~λ exactly solves the
minimization problem. When f is nearly quadratic, ~λ + ∆~λ should be a very approximate
estimate of the optimal solution. For each t, we calculate the Newton step repeatedly until
 the gap to the optimal solution is smaller then the tolerance parameter . When t increases,
~λ + ∆~λ will be closer and closer to the optimal solution. Let ~λ∗ (t) denote the solution ~λ0
with parameter t we get, and p~ denote the optimal solution of objective (2.10). We now
show the performance bound of our algorithm.
    Theorem 1. For a specific t, the maximum gap to the optimal solution is                      t .
                                                                                                3D
                                                                                                       i.e.,
f (~λ∗ (t)) − f (~
                 p) ≤   3D
                         t

    Proof: Let f0 denote the original Objective (2.10). For Objective (2.22), the optimality
condition is

                                D
                                               1            1            1
                tf00 (~λ0 ) +                                                     + λ̂i } = 0
                                X
                                      {              −             −
                                i=1
                                          si µi − λ0i λ0i − λi + Ui λi + Li − λ0i

    Divide the whole equation by t and let

                             D
                                                1               1                1
                    α(t) =
                             X
                                      {                 −                −                  }
                                i=1
                                          t(si µi − λi ) t(λi − λi + Ui ) t(λi + Li − λ0i )
                                                     0      0

                                                               D
                                                                 λ̂i
                                                      β(t) =
                                                               X

                                                               i=1
                                                                     t

    According to the duality theorem of optimization problem [4], α(t) and β(t) is a dual
feasible pair. The duality gap associated with ~λ0 and the dual feasible pair is mt , where m
is the number of inequality constraints. i.e., ~λ0 is no more than 3D -suboptimal.t         
    Then to minimize the migration cost based on the solution ~λ∗ (t), we transform the
problem into a minimum cost flow problem according to the following transformation rules:
    1) For each data center i that needs to send workload, add a source node vsi with
demand (λi 0 − λi ).
    2) For each data center j that needs to receive workload, add a sink node vrj with
demand (λj 0 − λj ).
    3) For each pair of vsi and vrj , add an edge (vsi , vrj ). Set its cost to cij , and its capacity
to min(Ui , Lj ).

                                                            15
After applying the above rules, the objective (2.11) is transformed into a minimum
cost flow algorithm. Such a problem can be solved by a capacity scaling algorithm [13] in
polynomial time. The pseudo-code of the whole algorithm is shown in Algorithm 1. The
algorithm takes the power constraints situation and server utilization of all data centers as
the input and returns the migration solution, the overall response time, and total migration
cost. Line 1 gets the new service rate µ0 based on the power constraints. Line 2 introduces
the punish function I_(u) to our problem and transform the Objective (2.10) to (2.21).
Line 3 initializes a feasible start point ~λ, an accuracy parameter of the punish function t,
an increase parameter α for t and a tolerance parameter . Lines 4-8 repeatedly calculate
the Newton step and update the accuracy parameter t until the gap is smaller than the
tolerance parameter . Line 9 creates a Graph G based on the solution ~λ∗ (t) we get and the
transformation rules. Line 10 calculates the minimum cost flow on Graph G using capacity
scaling algorithm and gets the migration solution X.

 Algorithm 1: Workload Migration
   Input : Power constraints and server utilization of all data centers.
   Output: Migration solution X, response time and total migration cost.
 1 Get the new service rate µ0 based on the power constraints condition;
 2 Introduce the punish function I_(u) and transform the Objective (2.10) to (2.21);
 3 Initialize a feasible ~
                         λ, t > 0, α > 1, tolerance  > 0;
           3D
 4 while t ≤  do
 5     Starting at ~λ, get ∆~λ by solving the KKT system;
 6     Update ~λ := ~λ + ∆~λ;
 7     Increase t := αt;
 8 end
 9 Create a Graph G based on the solution ~     λ∗ (t) and the transformation rules;
10 Get the minimum cost flow on Graph G using capacity scaling algorithm;
11 return The migration solution X, Response and Cost;

   The choices of the parameters in Algorithm 1 are not trivial. At first, the choice of the
parameter α involves a trade-off in the number of inner and outer iterations required. If α
is too small (near 1) then t will only increase by a very small factor at each outer iteration.
Thus for small t we expect a small number of Newton steps per outer iteration and the
algorithm will convergence near the optimal path. But of course, more outer iterations are
needed because the gap to the optimal solution is reduced by only a small amount each time.
On the other hand, we will have the opposite situation when α is large. During each outer
iteration, t increases a lot, so the current equation is probably not a good approximation
of the next iteration and we need more inner iterations to calculate the Newton step. But
this ’aggressive’ updating of t can decrease the number of outer iterations since the gap to
the optimal solution is decreased more in each step. The confirmation both in practice and

                                              16
theory [4] proves that values from around 10 to 20 or so seem to work well.

   Another very important parameter is the tolerance parameter . A small  can help us
get accurate results but will significantly increase the runtime of the algorithm since more
iterations are needed to convergence to the required gap. On the other hand, a very large 
can lead to inaccurate results and seriously damage the QoS in our problem. Thus we need
to carefully select the  based on the real situation and the priority of our demands. At
last, a good start point ~λ will definitely save the runtime of the algorithm. However, in this
problem we only have one feasible start point to choose from, that is the initial situation of
the data centers.

                                              17
Chapter 3

Battery Allocation against
Electrical Load Shedding

In this section, we first introduce the overall system model with backup batteries. Fig. 3.1
shows typical lead-acid battery groups and the monitor system. When there is electrical load
shedding or power outage, they will support the servers in a data center as backup power.
Yet the capacity of a battery group is limited, and a deep discharge of an energy-storage
battery group (typically lead-acid) will severely affect its internal structure, reducing its
capacity and lifetime. Fig. 3.2 shows a typical discharging curve for a lead-acid battery.
Most of its power is released during the linear region. In the last hyperbolic region, the
voltage falls very fast and it can only release a very small fraction of power, which should
be avoided. Then we formulate the battery allocation and workload migration against load
shedding as a multi-objective optimization problem. Table 3.1 lists the notations to be used
in this chapter.

                    (a) Battery groups                  (b) Monitor system

                   Figure 3.1: The backup batteries and the monitor system.

                                             18
2.1
                                                          discharging curve

                                  2    coup de
                                       fouet region
                                                          linear region

                   Voltage (v)
                                 1.9

                                 1.8

                                 1.7                                           hyperbolic
                                                                               region

                                 1.6
                                       0         2    4      6            8   10      12
                                                          Time (h)

         Figure 3.2: Typical discharge voltage versus time characteristics

                                               Table 3.1: Notations 2

D          the set of all the data centers
si         the number of opening servers in data center i
ni         the number of battery groups in data center i
Pb         the amount of power supplied by one backup battery
Ps         the amount of power required by one server
Pg         the amount of power supplied by power grid
λw,i       the arriving rate of web jobs in data center i
λb,i       the arriving rate of batch jobs in data center i
µi         the server-service rate in data center i
P0i        the probability that no task in queue in data center i
Wi         the average waiting time in data center i
  t
ri,n s
           the reserve time for data center i with ns battery groups at time t
cb         the replacement cost of a single battery group of data center
Ti,ni      the lifetime of batteries in data center i
Cr         the normalized battery replacement cost
Cm         the migration cost of all batch jobs
Ca ll      the normalized overall cost
Ψ          the average waiting time of batch jobs
Ui         the maximum upload speed in data center i
Li         the maximum download speed in data center i
cij        the migration cost per unit from data center i to j
Xij        the volume of workload migrated from data center i to j
nL         lower limit of battery group number in a data center
nU         upper limit of battery group number in a data center
B          the budget limit
ρi         the utilization of the servers in data center i
t          the accuracy of approximation in the punish function

                                                          19
Web Jobs

            Users                                                                         Users

                    ...           Data center                                                     ...

                    ...                                                                           ...

        Batch Jobs                              Workload migration                      Batch Jobs

                                       Power Outage

                          Backup Batteries         Power Grid        Backup Batteries

                          Figure 3.3: Framework of our system model
3.1      System Model
Fig. 3.3 shows the overall system model of the distributed data centers considered in this
chapter. In particular, we use D to denote the set of all distributed data centers in a
region. Each data center will be powered by both the power grid and backup batteries
under electrical load shedding. When there is power outage and the power grid is totally
infeasible, the data centers will only be powered by backup batteries. Similar to [24], we
distinguish all the workload of distributed data centers into two kinds of computing jobs: web
jobs that require to run continuously and batch jobs that can be delayed and interrupted.
Web jobs cannot be interrupted and the backup batteries must guarantee the web jobs are
not influenced under electrical load shedding. Yet the batch jobs have a deadline constraint
specified by the users. This part of the workload can be migrated to other distributed data
centers which have extra computing resources to meet the deadline constraint. Next, we
will introduce the details and give the notations of each part.

3.1.1     Web Jobs and Batch Jobs

   Workload in each distributed data center can be divided into web jobs and batch jobs.
Let λtw,i and λtb,i denote the arriving rate of web jobs and batch jobs in data center i at time
t. Web jobs must run continuously and cannot be interrupted (like web servers). Web jobs
must be finished at the current data center and cannot be migrated to other data centers.
Distributed data centers have to close some servers under electrical load shedding. Thus we
need to allocate more backup batteries to support more opening servers if the web jobs have

                                                        20
the potential to be interrupted under electrical load shedding. Let sti denote the number of
opening servers in data center i at time t.
   All the other workload are defined as batch jobs that can be delayed and interrupted (like
monthly payroll computation). Different from web jobs, if the current data center cannot
open enough servers to finish batch jobs under electrical load shedding, we can migrate
some of the batch jobs to other data centers and wait to be finished later. Let Wi denote
the average waiting time of batch jobs in data center i, which is determined by the arriving
rate λi , server-service rate µi , and the number of servers si .

3.1.2    Backup Batteries and Servers

   Let ni denote the number of battery groups in data center i and Pb denote the amount
of power supplied by one battery group. Assume all the servers in each data center are
homogeneous and each server needs Ps power to run. In the situation of electrical load
shedding, assume the power grid can only supply Pgt power at time t. The relationship of
the number of opening servers and backup batteries in data center i can be represented as:

                                      sti Ps ≤ ni Pb + Pgt , ∀t                           (3.1)

   Recall that the battery has already severely suffered from deep discharge at the hyper-
bolic region. To protect the battery, we disconnect it when the battery discharges to the end
of linear region (as illustrated in Fig. 3.2). To this end, we use ri,n
                                                                    t
                                                                        s
                                                                          to denote the reserve
time for data center i with ns battery groups at time t. When the reserve time decreases
to zero, the data center can only be supplied by the power grid under load shedding. The
batteries will also degrade in this process and need to be replaced when the lifetime is
surpassed. Let cb denote the replacement cost of a single battery group of data center. For
simplification, we assume an average for the shipment and labor cost among different data
centers. Then the overall battery replacement cost Cr (including purchase and installment)
can be represented by:

                                                X ns cb
                                         Cr =                                             (3.2)
                                                i∈D
                                                      Ti,ns

where Ti,ni is the lifetime of batteries in data center i. We use Ti,ni as denominator for
normalization (representing annual replacement cost).

3.2     Problem Formulation
Our objective is to allocate the appropriate number of battery groups in each data center
against electrical load shedding. We want to guarantee the web jobs are never interrupted
which can be represented as:

                                                 21
λtw,i ≤ N nti , ∀t, ∀i                             (3.3)

where N represents the computing capacity of one server.
   Then we want to minimize the waiting time of batch jobs at the same time. So our first
objective can be represented as:

                                   Min: Ψ =                                              (3.4)
                                               XX
                                                         λtb,i Wit
                                               t∈T i∈D

where T is the time-based index range of the considered period and Ψ represents the average
waiting time of batch jobs.
   Besides minimizing the waiting time of all the batch jobs, the operator of data centers
also wants to reduce the overall cost Call , which includes the battery replacement cost Cr
and workload migration cost Cm of batch jobs. Let Xij denote the volume of workload
migrated from data center i to j. If Xij is negative, it means data center i receives workload
from j. Then our another objective can be represented as follows:

                      Min: Call = Cr + Cm
                                  X ni cb                                                (3.5)
                                =          +         max{Xij , 0}cij
                                             XX
                                     T
                                  i∈D i,ns   i∈D j∈D

   s.t.

                                     Xij = −Xji , ∀i, ∀j                                 (3.6)
                                          λi
                                               < 1, ∀i                                   (3.7)
                                         si µi
                                            D
                                                                                         (3.8)
                                            X
                                   − Li ≤         Xij ≤ Ui , ∀i
                                            j=1

   Here, constraint (3.7) ensures the model is stable, which means the queue of tasks will
not increase infinitely. Constraint (3.8) ensures the migration limitation is not surpassed.
   In practice, there may be other factors that limit the number of battery groups equipped
in a data center. We thus use nL and nU to denote the lower limit and upper limit of the
number of battery groups, respectively, and this constraint can be represented as:

                                      nL ≤ ni ≤ nU , ∀i                                  (3.9)

   Besides, the operators of data centers usually want to control the overall cost within a
given upper bound B. So we also have another constraint which can be represented as:

                                           Call ≤ B                                    (3.10)

                                               22
Distributed DCs
                                                        information                 Multi-objective
                                                     under eletrical                 Optimization
                                                      load shedding

      Distributed DCs across a region

                                                                        Battery                       Workload
                                                                       Allocation                     Migration

      Battery information         Battery Degradation Prediction

                                 Figure 3.4: Framework of our solution
3.3       Solutions
In this section, we introduce the solutions to solve the optimization problem formulated
in the last section. The systematical design our solution framework is shown in Fig. 3.4.
Given a set of distributed data centers in a region, we first estimate the work information
of each data center under different power supply conditions. We have explored the work
conditions of distributed data centers under electrical load shedding in Chapter 2. The
work conditions of servers with a different number of battery groups can be calculated
accordingly. Our another previous work [35] has proposed a deep learning-based battery
profiling method to predict the battery degradation from the historical battery activities.
The reserve time and lifetime of battery groups can be calculated from the result of the
prediction easily. Given all of these results, we propose new solutions to solve the multi-
objective optimization problem. At first, we use Workload Migration algorithm introduced
in the last chapter to give the optimal migration schedule of batch jobs given the specific
number of battery groups in each data center with a bound. Based on this algorithm, we
propose another new algorithm Battery Allocation to allocate the appropriate number of
battery groups in each data center to optimize the objectives with all the constraints. Next,
we will introduce our new battery allocation framework.

3.3.1      Battery Allocation Framework
Given the profiling results of distributed data centers and battery features under electrical
load shedding, we next solve the battery allocation optimization problem. Recall that our
objective is first to guarantee the web jobs are not interrupted (Eq. (3.3)). Later we try
to migrate some of the batch jobs to minimize average waiting time and migration cost by

                                                       23
Algorithm 1. If we continue to allocate more battery groups after we guarantee the web jobs,
according to Eq. 3.2, more servers can be opening to support more batch jobs, resulting
in lower migration costs. The replacement cost of backup batteries may increase, however.
Thus the allocation of backup batteries is not a trivial problem to minimize the overall cost.
Indeed this multi-objective optimization problem is a multi-objective integer programming
problem, where the battery group number ni must be an integer between a lower bound nL
and an upper bound nU , and this kind of integer linear programming (ILP) is known to be
NP-complete. Thus we design a heuristic algorithm to solve it efficiently.
   Before jumping to the algorithm design, we first briefly analyze the characteristics of
this optimization problem. Intuitively, given the same external incidents happening to a
data center, the base station can support more servers under electrical load shedding when
equipped with more battery groups. Thus the average waiting time of batch jobs and mi-
gration cost will both decrease. The battery replacement cost is different, where the process
can be divided into two stages: In the first stage, when the number of battery groups in-
creases, the additional backup power helps the data center sustain a longer period against
load shedding and reduce deep discharging of batteries. So the lifetime of batteries is pro-
longed, achieving lower battery replacement cost. In the second stage, if we continue to
increase the number of battery groups, the extra backup power becomes redundant and the
battery replacement cost will increase due to the unavoidable battery degradation process.
Therefore, there may exist three situations for battery replacement cost according to the
different conditions of the corresponding data center, i.e., the cost first decreases and then
increases (both stage 1 and stage 2), the cost keeps decreasing (only stage 1), and the cost
keeps increasing (only stage 2). As the sum of battery replacement cost and migration cost,
the overall cost Call can also have this characteristic, which is future verified by our real
data-driven experiments in Section 5.
   Based on the above analysis, the design of our heuristic algorithm is shown in Algorithm
2, where we divide the whole process into two stages. In the first stage (line 1-7), for each
data center, we allocate enough battery groups until the servers can support all the web
jobs. In the second stage (line 8-17), we first allocate as many as possible battery groups
for each data center when the battery replacement cost decreases. Since the migration will
never increase with more backup power, the overall cost will also decrease in this stage.
Then if the overall cost still does not exceed the budget limit and average waiting time
still decreases. To better balance the tradeoff between average waiting time and cost, we
define Gain as the ratio of the weighted average waiting time decrease and the overall cost
increase

                                        Ψi (ni ) − Ψi (ni + 1)
                             Gain =                                                    (3.11)
                                      Ci,all (ni + 1) − Ci,all (ni )

                                              24
Algorithm 2: Battery Allocation
   Input : Results of data center and battery features profiling.
   Output: The allocation results ni for every data center i and average waiting time
             Ψ of batch jobs.
 1 foreach i in D do
 2    Set initial battery assignment as ni = nL ;
 3    while ni ≤ nU do
 4        Increase ni until Eq. (3.3) is satisfied;
 5        Calculate Cr and switch to next station;
 6    end
 7 end
 8 while Call ≤ B and Ψ still decreases do
 9    Try to pre-allocate one more battery group for each data center i and calculate
        the Ψ, Cr and Call ;
10    if Cr begin to increase then
11        Choose data center i that leads to maximum Gain and add one battery
           group;
12    end
13    else
14        Do add one battery group for data center i;
15    end
16    Update Ψ and Call ;
17 end
18 return ni for all the data center i, Ψ and Call ;

                                          25
We each time select the data center with the maximum Gain and add one battery group
to it until we reach the budget. By utilizing such a greedy approach we guarantee to reduce
the most average waiting time with the least cost increase for each step. We next analyze the
complexity of our heuristic algorithm to show its efficiency. In the first stage, we only access
each data center once and the complexity is O(n). In the second stage, we calculate Ψ, Cr
and Call for each data center and iteratively select data center with the maximum Gain,
which contributes the complexity of O(nlog(n)). So the total complexity of our algorithm
is O(nlog(n)).

                                              26
Chapter 4

Performance Evaluation

In this chapter, Based on real-world data and simulation, we evaluate the performance of
propose Workload Migration algorithm and Battery Allocation framework respectively.

4.1     Evaluation of Workload Migration Algorithm
In this section, we will introduce our experiment setup and the performance evaluation of
our proposed Workload Migration algorithm.

4.1.1   Simulation Setup
We conduct extensive trace-driven simulations with diverse networks, computing and energy
configurations to evaluate the performance of our method. We also compare it with some
baseline methods. The number of servers in a single distributed data center varies from 2 to
20. The average task-arriving and server-service rates are generated by uniform distribution,
where the former vary from 10 to 15 MB/s and the latter from 15 to 20 MB/s. Here the

                              Table 4.1: Server configuration

         Item                    Quantity Configuration
         Processor               4        Intel    8158  3.0   GHz   150W
                                          12C/24.75MB          Cache/DDR4
                                          2666MHz
         Memory                  48       128GB      DDR4-2666-MHz    TSV-
                                          RDIMM/PC4-23100/octal
                                          rank/x4/1.2v
         Power Supply            4        1600W PSU1
         Dedicated Storage       1        RAID-Cisco 12G SAS Modular Con-
         Controller                       troller 4GB FBWC
         Storage                 16       480GB 6Gb SATA 2.5-inch SSD En-
                                          terprise Value
         Adapter                 12       GPU-NVIDIA V100 32GB

                                             27
You can also read