The Design and Implementation of the KOALA Co-Allocating Grid Scheduler

Page created by Amanda Ramirez
 
CONTINUE READING
The Design and Implementation of the KOALA
        Co-Allocating Grid Scheduler

                        H.H. Mohamed and D.H.J. Epema

       Faculty of Electrical Engineering, Mathematics, and Computer Science
                            Delft University of Technology
                  P.O. Box 5031, 2600 GA Delft, The Netherlands
                e-mail: H.H.Mohamed, D.H.J.Epema@ewi.tudelft.nl

      Abstract. In multicluster systems, and more generally, in grids, jobs
      may require co-allocation, i.e., the simultaneous allocation of resources
      such as processors and input files in multiple clusters. While such jobs
      may have reduced runtimes because they have access to more resources,
      waiting for processors in multiple clusters and for the input files to be-
      come available in the right locations, may introduce inefficiencies. In this
      paper we present the design of KOALA, a prototype for processor and
      data co-allocation that tries to minimize these inefficiencies through the
      use of its Close-to-Files placement policy and its Incremental Claiming
      Policy. The latter policy tries to solve the problem of a lack of support
      for reservation by local resource managers.

1   Introduction
Grids offer the promise of transparent access to large collections of resources for
applications demanding many processors and access to huge data sets. In fact,
the needs of a single application may exceed the capacity available in each of
the subsystems making up a grid, and so co-allocation, i.e., the simultaneous
access to resources of possibly multiple types in multiple locations, managed by
different resource managers [1], may be required.
    Even though multiclusters and grids offer very large amounts of resources,
to date most applications submitted to such systems run in single subsystems
managed by a single scheduler. With this approach, grids are in fact used as big
load balancing devices, and the function of a grid scheduler amounts to choosing
a suitable subsystem for every application. The real challenge in resource man-
agement in grids lies in co-allocation. Indeed, the feasibility of running parallel
applications in multicluster systems by employing processor co-allocation has
been demonstrated [2, 3].
    In this paper, we present the design and implementation of a grid scheduler
named KOALA on our wide-area Distributed ASCI Supercomputer (DAS, see
Section 2.1). KOALA includes mechanisms and policies for both processor and
data co-allocation in multicluster systems, and more generally, in grids. KOALA
uses the Close-to-Files (CF) and the Worst Fit (WF) policy for placing job com-
ponents on clusters with enough idle processors. Our biggest problem in pro-
cessor co-allocation is the lack of a reservation mechanism in the local resource
managers. In order to solve this problem, we propose the Incremental Claiming
Policy (ICP), which optimistically postpones the claiming of the processors for
the job to a time close to the estimated job start time.
    In this paper we present complete design of KOALA which has been imple-
mented and tested extensively in the DAS testbed. A more extensive description
and evaluation of its placement policies can be found in [4]. The main contribu-
tions of this paper are a reliably working prototype for co-allocation and the ICP
policy, which is a workaround method for processor reservation. The evaluation
of the performance of ICP will be the subject of a future paper.

2     A Model for Co-allocation
In this section, we present our model of co-allocation in multiclusters and in
grids.

2.1   System Model
Our system model is inspired by DAS [5] which is a wide-area computer system
consisting of five clusters (one at each of five universities in the Netherlands,
amongst which Delft) of dual-processor Pentium-based nodes, one with 72, the
other four with 32 nodes each. The clusters are interconnected by the Dutch
university backbone (100 Mbit/s), while for local communications inside the
clusters Myrinet LANs are used (1200 Mbit/s). The system was designed for
research on parallel and distributed computing. On single DAS clusters, PBS
[6] is used as a local resource manager. Each DAS cluster has its own separate
file system, and therefore, in principle, files have to be moved explicitly between
users’ working spaces in different clusters.
     We assume a multicluster environment with sites that each contain compu-
tational resources (processors), a file server, and a local resource manager. The
sites may combine their resources to be managed by a grid scheduler when ex-
ecuting jobs in a grid. The sites where the components of a job run are called
its execution sites, and the site(s) where its input file(s) reside are its file sites.
In this paper, we assume a single central grid scheduler, and the site where it
runs is called the submission site. Of course, we are aware of the drawbacks of
a single central submission site and currently we are working on extending our
model to multiple submission sites.

2.2   Job Model
By a job we mean a parallel application requiring files and processors that can
be split up into several job components which can be scheduled to execute on
multiple execution sites simultaneously (co-allocation) [7, 8, 1, 4]. This allows the
execution of large parallel applications requiring more processors than available
on a single site [4]. Job requests are supposed to be unordered, meaning that
a job only specifies the numbers of processors needed by its components, but
not the sites where these components should run. It is the task of the grid
scheduler to determine in which cluster each job component should run, to move
the executables as well as the input files to those clusters before the job starts,
and to start the job components simultaneously.
    We consider two priority levels, high and low, for jobs. The priority levels
play part only when a job of a high priority is about to start executing. At this
time and at the execution sites of the job, it is possible for processors requested
by the job to be occupied by other jobs. Then, if not enough processors are
available, a job of high priority may preempt low-priority jobs until enough idle
processors for it to execute are freed (see Section 4.2).
    We assume that the input of a whole job is a single data file. We deal with
two models of file distribution to the job components. In the first model, job
components work on different chunks of the same data file, which has been
partitioned as requested by the components. In the second model, the input
to each of the job components is the whole data file. The input data files have
unique logical names and are stored and possibly replicated at different sites. We
assume that there is a replica manager that maps the logical file names specified
by jobs onto their physical location(s).

2.3   Processor Reservation

In order to achieve co-allocation for a job, we need to guarantee the simultaneous
availability of sufficient numbers of idle processors to be used by the job at
multiple sites. The most straight-forward strategy to do so is to reserve processors
at each of the selected sites. If the local schedulers do support reservations,
this strategy can be implemented by having a global grid scheduler obtain a
list of available time slots from each local scheduler, and reserve a common
timeslot for all job components. Unfortunately, a reservation-based strategy in
grids is currently limited due to the fact that only few local resource managers
support reservations (for instance, PBS-pro [9] and Maui [10] do). In the absence
of processor reservation, good alternatives are required in order to achieve co-
allocation.

3     The Processor and Data Co-Allocator

We have developed a Processor and Data Co-Allocator (KOALA) prototype of
a co-allocation service in our DAS system (see Section 2.1). In this section we
describe the components of the KOALA, and how they work together to achieve
co-allocation.

3.1   The KOALA components

The KOALA consists of the following four components: the Co-allocator (CO),
the Information Service (IS), the Job Dispatcher (JD), and the Data Mover
(DM). The components and interactions between them are illustrated in Figure
1, and are described below.
    The CO accepts a job request (arrow 1 in the figure) in the form of a job
Description File (JDF). We use the Globus Resource Specification Language
(RSL) [11] for JDFs, with the RSL ”+”-construct to aggregate the components’
requests into a single multi-request. The CO uses a placement policy (see Sec-
tion 4.1) to try to place jobs, based on information obtained from the IS (arrow
2). The IS is comprised of the Globus Toolkit’s Metacomputing Directory Ser-
vice (MDS) [11] and Replica Location Service (RLS) [11], and Iperf [12], a tool
to measure network bandwidth. The MDS provides the information about the
numbers of processors currently used and the RLS provides the mapping infor-
mation from the logical names of files to their physical locations. After a job
has been successfully placed, i.e., the file sites and the execution sites of the job
components have been determined, the CO forwards the job to the JD (arrow
3).
    On receipt of the job, the JD instructs the DM (arrow 4) to initiate the third-
party file transfers from the file sites to the execution sites of the job components
(arrows 5). The DM uses Globus GridFTP [13] to move files to their destinations.
The JD then determines the Job Start Time and the appropriate time that the
processors required by a job can be claimed (Job Claiming Time)(Section 3.3)
At this time, the JD uses a claiming policy (see Section 4.2) to determine the
components that can be started based on the information from the IS (arrow 6.1).
The components which can be started are sent to their Local Schedulers (LSs)
of respective execution sites through the Globus Resource Allocation Manager
(GRAM) [11].
    Synchronization of the start of the job components is achieved through a
piece of code added to the application which delays the execution of the job
components until the estimated Job Start Time (see Section 4).

                                                                      File sites

                                                              FS 1         FS 2         FS n
                 NWS         MDS

                                                                      5
                                IS                       DM
                       RLS

                                               4
                            2       6.1

  JDF                  CO                 JD                    GRAM               JS
             1                  3                  6.2

                       Submission site                        Execution sites

Fig. 1. The interaction between the KOALA components. The arrows correspond to
the description in Section 3.
3.2   The Placement Queue
When a job is submitted to the system, the KOALA tries to place it according to
one of its placement policies (Section 4.1). If a placement try fails, the KOALA
adds the job to the tail of the so-called placement queue , which holds all jobs
that have not yet been successfully placed. The KOALA regularly scans the
placement queue from head to tail to see whether any job in it can be placed.
For each job in the queue we maintain its number of placement tries, and when
this number exceeds a threshold, the job submission fails. This threshold can
be set to ∞, i.e., no job placement fails. The time between successive scans of
the placement queue is adaptive; it is computed as the product of the average
number of placement tries of the jobs in the queue and a fixed interval (which
is a parameter of the KOALA). The time when job placement succeeds is called
its Job Placement Time (see Figure 2).
                                             TWT
                    Placement
                                                FTT + ACT                  Job Run Time
                       Time                  PGT             PWT
                       placement
                          tries
                   A               B                    C              D                  E

                   A: Job Submission Time            TWT: Total Waiting Time
                   B: Job Placement Time (JPT)       PGT: Proccesor Gained Time
                   C: Job Claiming Time (JCT)        PWT: Processor Wasted Time
                   D: Job Start Time (JST)           FTT: Estimated File Transfer Time
                   E: Job Finish Time                ACT: Additional Claiming Tries

                       Fig. 2. The timeline of a job submission.

3.3   The Claiming Queue
After the successful placement of a job, its File Transfer Time (FTT) and its
Job Start Time (JST) are estimated before the job is added to the so-called
claiming queue. This queue holds jobs which have been placed but for which the
allocated processors have not yet been claimed. The job’s FTT is calculated as
the maximum of all of its components’ estimated transfer times, and the JST is
estimated as the sum of its Job Placement Time (JPT) and its FTT (see Figure
2). We then set its Job Claiming Time (JCT) (point C in Figure 2) initially to
the sum of its JPT and the product of L and FTT:
                                   JCT0 = JP T + L · F T T ,
where L is job-dependent parameter, with 0 < L < 1. In the claiming queue,
jobs are arranged in increasing order of their JCT.
    We try to claim (claiming try) at the current JCT by using our Incremental
Claiming Policy (see Section 4.2). The job is removed from the claiming queue
if claiming for all of its components has succeeded. Otherwise, we perform suc-
cessive claiming tries. For each such try we recalculate the JCT by adding to
the current JCT the product of L and the time remaining until the JST (time
between points C and D in Figure 2):
JCTn+1 = JCTn + L · (JST − JCTn ).
If the job’s JCTn+1 reaches its JST and for some of its components claiming
has still not succeeded, the job is returned to the placement queue (see Section
3.2). Before doing so, its parameter L is decreased by a fixed amount and its
components that were successfully started in previous claiming tries are aborted.
The parameter L is decreased in each claiming try until its lower bound is reached
so as to increase the chance of claiming success. If the number of times we have
performed claiming try for the job exceeds some threshold (which can be set to
∞), the job submission fails.
    We call the time between the JPT of a job and the time of successfully
claiming processors for it, the Processor Gained Time (PGT) of the job. The
time between the successful claiming and the actual job start time is Processor
Wasted Time (PWT) (see Figure 2). During the PGT, jobs submitted through
other schedulers than our grid scheduler can use the processors. The time from
the submission of the job until its actual start time is called the Total Waiting
Time (TWT) of the job.

4     Co-allocation Policies
In this section, we present the co-allocation policies for placing jobs and claiming
processors that are used with KOALA.

4.1    The Close-to-Files placement Policy
Placing a job in a multicluster means finding a suitable set of execution sites
for all of its components and suitable file sites for the input file. (Different com-
ponents may get the input file from different locations.) The most important
consideration here is of course finding execution sites with enough processors.
However, when there is a choice among execution sites for a job component,
we choose the site such that the (estimated) delay of transferring the input file
to the execution site is minimal. We call the placement policy doing just this
the Close-to-Files (CF) policy. A more extensive description and performance
analysis of this policy can be found in [4].
     Built into the KOALA is also the Worst Fit (WF) placement policy. WF
places the job components in decreasing order of their sizes on the execution
sites with the largest (remaining) number of idle processors. In case the files are
replicated, we select for each component the replica with the minimum estimated
file transfer time to that component’s execution site.
     Note that both CF and WF may place multiple job components on the same
cluster. We also remark that both CF and WF make perfect sense in the absence
of co-allocation.

4.2    The Incremental Claiming Policy
Claiming processors for job components starts at a job’s initial JCT and is re-
peated at subsequent claiming tries. Claiming for a component will only succeed
if there are still enough idle processors to run it. Since we want the job to start
with minimal delay, the component may, in the process of the claiming policy,
be re-placed using our placement policy. A re-placement can be accepted if the
file from the new file site can be transferred to the new execution site before the
job’s JST. We can further minimize the delay of starting high priority jobs by
allowing them to preempt low priority jobs at their execution sites.
     We call the policy doing all of this the Incremental Claiming Policy (ICP),
which operates as follows (the line numbers mentioned below refer to Algorithm
1). For a job, ICP first determines the sets Cprev , Cnow , and Cnot of components
that have been previously started, of components that can be started now based
on the current number of idle processors, and of components that cannot be
started based on these numbers, respectively. It further calculates F , which is
the sum of the fractions of the job components that have previously been started
and components that can be started in the current claiming try (line 1). We
define T as the required lower bound of F ; the job is returned to the claiming
queue if its F is lower than T (line 2).

Algorithm 1 Pseudo-code of the Incremental Claiming Policy
Require: job J is already placed
Require: set Cprev of previously started components of J
Require: set Cnow of components of J that can be started now
Require: set Cnot of components of J that cannot be started now
 1: F ⇐ (|Cprev | + |Cnow |)/|J|
 2: if F ≥ T then
 3:    if Cnot 6= ∅ then
 4:       for all k ∈ Cnot do
 5:         (Ek , Fk , f ttk ) ⇐ P lace(k)
 6:         if f ttk + JCT < JST then
 7:            Cnow ⇐ Cnot \ {k}
 8:         else if J priority is high then
 9:            Pk ⇐ count(processors) \* used by low-priority jobs at Ek ∗\
10:            if Pk ≥ size of k then
11:               repeat
12:                  Preempt low-priority jobs at Ek
13:               until count(freed processors) ≥ size of k \* at Ek ∗\
14:               Cnow ⇐ Cnot \ {k}
15:    start components in Cnow

    Otherwise, for each component k that cannot be started, ICP first tries to
find a new pair of execution site-file site with the CF policy (line 5). On success,
the new execution site Ek , file site Fk and the new estimated transfer time
between them, f ttk , are returned. If it is possible to transfer file between these
sites before JST (line 6), the component k is moved from the set Cnot to the set
Cnow (line 7).
    For a job of high priority, if the file cannot be transferred before JST or the
re-placement of component failed (line 8), the policy performs the following. At
the execution site Ek of component k, it checks whether the sum of its number of
idle processors and the number of processors currently being used by low-priority
jobs is at least equal to the number of processors the component requests (lines 9
and 10). If so, the policy preempts low-priority jobs in descending order of their
JST (newest-job-first) until a sufficient number of processors have been freed
(lines 11-13). The preempted jobs are then returned to the placement queue.
    Finally, those components that can be claimed at this claiming try are started
(line 15). For this purpose, a small piece of code has been added with the sole
purpose of delaying the execution of the job barrier until the job start time.
Synchronization is achieved by making each component wait on the barrier until
it hears from all the other components.
    When T is set to 1 the claiming process becomes atomic, i.e., claiming only
succeeds if for all the job components processors can be claimed.

4.3   Experiences

We have gathered extensive experience with the KOALA while performing hun-
dreds of experiments to asses the performance of the CF placement policy in [4].
In each of these experiments, more than 500 jobs were successfully submitted
to the KOALA. These experiments proved the reliability of the KOALA. An
attempt was also made to submit jobs to the GridLab [14] testbed, which is a
heterogeneous grid environment. KOALA managed also to submit jobs success-
fully to this testbed, and currently more experiments have been planned on this
testbed.

5     Related Work

In [15, 16], co-allocation (called multi-site computing there) is studied also with
simulations, with as performance metric the (average weighted) response time.
There, jobs only specify a total number of processors, and are split up across
the clusters. The slow wide-area communication is accounted for by a factor r
by which the total execution times are multiplied. Co-allocation is compared to
keeping jobs local and to only sharing load among the clusters, assuming that
all jobs fit in a single cluster. One of the most important findings is that for r
less than or equal to 1.25, it pays to use co-allocation. In [17] an architecture
for a grid superscheduler is proposed, and three job migration algorithms are
simulated. However, there is no real implementation of this scheduler, and jobs
are confined to run within a single subsystem of a grid, reducing the problem
studied to a traditional load-balancing problem.
    In [18], the Condor class-ad matchmaking mechanism for matching single
jobs with single machines is extended to ”gangmatching” for co-allocation. The
running example in [18] is the inclusion of a software license in a match of a
job and a machine, but it seems that the gangmatching mechanism might be
extended to the co-allocation of processors and data.
    5by binding execution and storage sites into I/O communities that reflect
the physical reality.
In [19], the scheduling of sequential jobs that need a single input file is studied
in grid environments with simulations of synthetic workloads. Every site has a
Local Scheduler, an External Scheduler (ES) that determines where to send
locally submitted jobs, and a Data Scheduler (DS) that asynchronously, i.e.,
independently of the jobs being scheduled, replicates the most popular files stored
locally. All combinations of four ES and three DS algorithms are studied, and
it turns out that sending jobs to the sites where their input files are already
present, and actively replicating popular files, performs best.
    In [20], the creation of abstract workflows consisting of application compo-
nents, their translation into concrete workflows, and the mapping of the latter
onto grid resources is considered. These operations have been implemented using
the Pegasus [21] planning tool and the Chimera [22] data definition tool. The
workflows are represented by DAGs, which are actually assigned to resources
using the Condor DAGMan and Condor-G [23]. As DAGs are involved, no si-
multaneous resource possession implemented by a co-allocation mechanism is
needed.
    In the AppLes project [24], each grid application is scheduled according to its
own performance model. The general strategy of AppLes is to take into account
resource performance estimates to generate a plan for assigning file transfers to
network links and tasks (sequential jobs) to hosts.

6    Conclusions

We have addressed the problem of scheduling jobs consisting of multiple compo-
nents that require both processor and data co-allocation in multicluster systems
and grids in general. We have developed KOALA, a prototype for processor and
data co-allocation which implements our placement and claiming policies. Our
initial experiences show the correct and reliable operation of the KOALA.
    As future work, we are planning to remove the bottleneck of a single global
scheduler, and to allow flexible jobs that only specify the total number of proces-
sors needed and allow the KOALA to fragment jobs into components (the way
of dividing the input files across the job components is then not obvious). In
addition, more extensive performance study of the KOALA in a heterogeneous
grid environment has been planned.

References
 1. Czajkowski, K., Foster, I.T., Kesselman, C.: Resource Co-Allocation in Compu-
    tational Grids. In: Proc. of the Eighth IEEE International Symposium on High
    Performance Distributed Computing (HPDC-8). (1999) 219–228
 2. van Nieuwpoort, R., Maassen, J., Bal, H., Kielmann, T., Veldema, R.: Wide-Area
    Parallel Programming Using the Remote Method Invocation Method. Concur-
    rency: Practice and Experience 12 (2000) 643–666
 3. Banen, S., Bucur, A., Epema, D.: A Measurement-Based Simulation Study of
    Processor Co-Allocation in Multicluster Systems. In Feitelson, D., Rudolph, L.,
Schwiegelshohn, U., eds.: 9th Workshop on Job Scheduling Strategies for Parallel
      Processing. Volume 2862 of LNCS. Springer-Verlag (2003) 105–128
 4.   Mohamed, H., Epema, D.: An Evaluation of the Close-to-Files Processor and
      Data Co-Allocation Policy in Multiclusters. In: Proc. of CLUSTER 2004, IEEE
      Int’l Conference Cluster Computing 2004. (2004)
 5.   Web-site: (The distributed asci supercomputer (das)) http://www.cs.vu.nl/das2.
 6.   Web-site: (The portable batch system) www.openpbs.org.
 7.   Bucur, A., Epema, D.: Local versus Global Queues with Processor Co-Allocation
      in Multicluster Systems. In Feitelson, D., Rudolph, L., Schwiegelshohn, U., eds.:
      8th Workshop on Job Scheduling Strategies for Parallel Processing. Volume 2537
      of LNCS. Springer-Verlag (2002) 184–204
 8.   Ananad, S., Yoginath, S., von Laszewski, G., Alunkal, B.: Flow-based Multistage
      Co-allocation Service. In d’Auriol, B.J., ed.: Proc. of the International Conference
      on Communications in Computing, Las Vegas, CSREA Press (2003) 24–30
 9.   Web-site: (The portable batch system) http://www.pbspro.com/.
10.   Web-site: (Maui scheduler) http://supercluster.org/maui/.
11.   Web-site: (The globus toolkit) http://www.globus.org/.
12.   Web-site: (Iperf version 1.7.0) http://dast.nlanr.net/Projects/Iperf/.
13.   Allcock, W., Bresnahan, J., Foster, I., Liming, L., Link, J., Plaszczac, P.: GridFTP
      Update. Technical report (2002)
14.   Web-site: (A grid application toolkit and testbed) http://www.gridlab.org/.
15.   Ernemann, C., Hamscher, V., Schwiegelshohn, U., Yahyapour, R., Streit, A.: On
      Advantages of Grid Computing for Parallel Job Scheduling. In: 2nd IEEE/ACM
      Int’l Symposium on Cluster Computing and the GRID (CCGrid2002). (2002) 39–
      46
16.   Ernemann, C., Hamscher, V., Streit, A., Yahyapour, R.: Enhanced Algorithms for
      Multi-Site Scheduling. In: 3rd Int’l Workshop on Grid Computing. (2002) 219–231
17.   Shan, H., Oliker, L., Biswas, R.: Job superscheduler architecture and performance
      in computational grid environments. In: Supercomputing ’03. (2003)
18.   Raman, R., Livny, M., Solomon, M.: Policy driven heterogeneous resource co-
      allocation with gangmatching. In: 12th IEEE Int’l Symp. on High Performance
      Distributed Computing (HPDC-12). IEEE Computer Society Press (2003) 80–89
19.   Ranganathan, K., Foster, I.: Decoupling Computation and Data Scheduling in
      Distributed Data-Intensive Applications. In: 11 th IEEE International Symposium
      on High Performance Distributed Computing HPDC-11 2002 Edinburgh, Scotland.
      (2002)
20.   Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K.: Mapping
      Abstract Complex Workflows onto Grid Environments. J. of Grid Computing 1
      (2003) 25–39
21.   Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Vahi, G.M.K., Koranda, S., Laz-
      zarini, A., Papa, M.A.: From Metadata to Execution on the Grid Pegasus and the
      Pulsar Search. Technical report (2003)
22.   Foster, I., Vockler, J., Wilde, M., Zhao, Y.: Chimera: A Virtual Data System for
      Representing, Querying, and Automating Data Derivation. In: 14th Int’l Conf. on
      Scientific and Statistical Database Management (SSDBM 2002). (2002)
23.   Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Com-
      putation Management Agent for Multi-Institutional Grids. In: Proceedings of the
      Tenth IEEE Symposium on High Performance Distributed Computing (HPDC),
      San Francisco, California (2001) 7–9
24.   Casanova, H., Obertelli, G., Berman, F., Wolski, R.: The AppLeS Parameter Sweep
      Template: User-Level Middleware for the Grid. (2000) 75–76
You can also read