Runtime Locality Optimizations of Distributed Java Applications

Page created by Johnny Hughes
 
CONTINUE READING
16th Euromicro Conference on Parallel, Distributed and Network-Based Processing

                      Runtime Locality Optimizations of Distributed Java Applications

                                                      Christian Hütter, Thomas Moschny
                                                            University of Karlsruhe
                                                   {huetter, moschny}@ipd.uni-karlsruhe.de

                                      Abstract                                         performance gains through parallelism in a distributed
                                                                                       environment.
                                                                                           Solely distributing objects and threads over virtual
             In distributed Java environments, locality of objects                     machines is not sufficient for achieving performance
          and threads is crucial for the performance of parallel                       gains. Since the placement of an object determines the
          applications. We introduce dynamic locality                                  processor of its methods, only methods of objects that
          optimizations in the context of JavaParty, a                                 reside on different machines can actually be executed
          programming and runtime environment for parallel                             in parallel. So we have two conflicting goals: On the
          Java applications. Until now, an optimal distribution                        one hand, groups of objects with frequent and
          of the individual objects of an application has to be                        expensive communication should be placed on the
          found manually, which has several drawbacks.                                 same node. On the other hand, objects should be
             Based on a former static approach, we develop a                           distributed over the available processors to enable
          dynamic methodology for automatic locality                                   parallelism.
          optimizations. By measuring processing and                                       Until now, JavaParty provides a mechanism to
          communication times of remote method calls at                                create remote objects on specific nodes of a cluster
          runtime, a placement strategy can be computed that                           environment. The developer is responsible for
          maps each object of the distributed system to its                            distributing the individual objects and thus for
          optimal virtual machine. Objects then are migrated                           distributing the activities to the processing nodes. Such
          between the processing nodes in order to realize this                        a manual approach has several disadvantages. First, the
          placement strategy. We evaluate our approach by                              object distribution is dependent on the specific
          comparing the performance of two benchmark                                   topology for which the program is compiled. The
          applications with manually distributed versions. It is                       distribution strategy must be adapted to each target
          shown that our approach is particularly suitable for                         platform. Second, manually specifying the location of
          dynamic applications where the optimal object                                every single object creation is tedious. Third, the
          distribution varies at runtime.                                              optimal placement of objects often cannot be
                                                                                       determined statically for dynamic applications where
                                                                                       the optimal location of objects changes at runtime.
                                                                                           The work at hand focuses on the automatic
          1. Introduction                                                              generation of a distribution strategy for remote objects.
                                                                                       The generation is based on runtime information of the
             Java enables developers to express concurrency and                        distributed system. Thus, the programmer does not
          to create parallel applications by means of threads.                         have to worry about a proper object distribution and
          Performance gains over a sequential solution can only                        can focus on the solution of the problem. Even if the
          be expected if the virtual machine is executed on a                          initial object distribution generated by JavaParty is not
          system with several processors. JavaParty [10] extends                       optimal, the locality of the application is optimized at
          Java by a distributed runtime environment that consists                      runtime.
          of several Java virtual machines. The virtual machines                           In chapter 2 we give a brief overview of JavaParty.
          are executed on the nodes of a cluster of workstations.                      Chapter 3 discusses related work in the field of
          Each virtual machine has its own address space, but                          distributed Java applications. In Chapter 4 we describe
          can perform remote method invocations on other                               the design of our approach and explain some basic
          virtual machines. Thus, JavaParty allows for                                 concepts that are necessary for further understanding.

   0-7695-3089-3/08 $25.00 © 2008 IEEE                                           149
   DOI 10.1109/PDP.2008.76

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
Chapter 5 presents the implementation and discusses                          a standard JVM. The advantage of using non-standard
          the problems we encountered. In chapter 6 we evaluate                        JVMs is increased efficiency due to the ability to
          the effectiveness and efficiency of our work using two                       access machine resources directly rather than through
          benchmark applications. Finally, chapter 7 concludes                         the JVM. A weakness of such systems is their lack of
          this paper.                                                                  cross-platform compatibility.
                                                                                          cJVM aims at virtualizing a cluster and at obtaining
          2. JavaParty                                                                 high performance for regular Java applications. A
                                                                                       number of optimization techniques are used to address
             JavaParty extends Java by a pre-processor and a                           caching, locality of execution and object placement.
          runtime environment for distributed parallel                                 The smart proxy mechanism of cJVM can be used as
          programming in workstation clusters. It transparently                        framework to implement different locality protocols.
          adds remote objects to Java whose methods can be                             Currently, cJVM is unable to use a standard JIT
          invoked from remote virtual machines. Programmers                            compiler and does not implement a custom one.
          can use the keyword remote to indicate that a class                             JESSICA2 applies transparent Java thread
          should be remotely accessible. Instances of remote                           migration to multi-threaded Java applications. The
          classes are called remote objects, regardless on which                       migration mechanism allows distributing threads
          virtual machine they reside. The runtime system offers                       among cluster nodes at runtime. To support shared
          a mechanism to migrate remote objects between                                object access, a global object space has been
          machines.                                                                    implemented. The system includes some important
                                                                                       features, e.g. load balancing through thread migration,
             Java Remote Method Invocation (RMI) [14]
                                                                                       an adaptive home-migration protocol, and a custom
          permits the creation of classes whose instances can be
                                                                                       JIT compiler.
          accessed remotely from other JVMs. JavaParty uses
                                                                                          Other systems compile the source or class files of a
          RMI as target and thus inherits some of its advantages,
                                                                                       Java application into native machine code. Both
          e.g. distributed garbage collection. It uses a special
                                                                                       Hyperion [1] and Jackal [15] support standard Java
          pre-processor to generate pure Java source code that is
                                                                                       and do not change its programming paradigm. The
          consistent with the RMI requirements. This approach
                                                                                       usage of a custom source or byte code compiler has the
          hides the increased program complexity due to RMI
                                                                                       disadvantage that such a compiler must continually be
          constraints as well as the additional code for creation
                                                                                       adapted to changes of the Java language specification.
          and access of remote objects.
                                                                                       The advantage of compiler-based systems is their
             JavaParty code is transformed into regular Java
                                                                                       increased performance because of compiler
          code plus RMI hooks. The resulting RMI portions are
                                                                                       optimizations and direct access to system resources.
          fed into the RMI compiler to generate stubs and
                                                                                          Hyperion offers an infrastructure for heterogeneous
          skeletons. Since existing code might be using the
                                                                                       clusters providing the illusion of a single JVM. The
          original classes, handle objects are introduced that hide
                                                                                       original Java threads are mapped onto native system
          the RMI classes from the user. This approach
                                                                                       threads which are spread across the processing nodes
          maintains the Java object semantics such that the
                                                                                       to provide load balancing. The Java memory model is
          programmer can use remote objects just like normal
                                                                                       implemented by a DSM protocol, so the original
          Java objects.
                                                                                       semantics of the Java language is kept unchanged. To
                                                                                       achieve portability, the Hyperion platform has been
          3. Related work                                                              built on top of a portable runtime environment which
                                                                                       supports various networks and communication
             This section gives an overview of existing systems                        interfaces.
          for distributed execution of Java applications. The goal                        Jackal is a DSM system for Java which consists of
          of these systems is to gain increased computational                          an optimizing compiler and a runtime system. In
          power while preserving Java’s parallel programming                           combination with compiler optimizations, Jackal
          paradigm. In [3], distributed runtime systems are                            applies various runtime optimizations to increase
          categorized into cluster-aware VMs, compiler-based                           locality and manage large data structures. The runtime
          DSM systems, and systems using standard JVMs.                                system includes a distributed garbage collector and
             The first category consists of systems that use a                         provides thread and object location transparency.
          non-standard JVM on each node to execute distributed                            While most systems use standard JVMs, only a few
          applications. The most important examples of such                            of them preserve the standard Java programming
          systems are cJVM [2] and JESSICA2 [16]. Both                                 paradigm. Examples for such systems are
          approaches provide a complete single system image of                         JavaSymphony [4] and ADAJ [5]. Using standard

                                                                                 150

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
JVMs has the advantage that such systems can use                             slicing and blocking, competing activities on one JVM
          heterogeneous nodes which locally optimize their                             decrease the total parallelism. Additional costs are
          performance using a JIT compiler. The main                                   introduced by the remote method invocation itself
          disadvantage of such systems is their relatively slow                        because of communication latency and bandwidth
          access to system resources.                                                  limitations. Thus, the general distribution strategy must
             JavaSymphony is a programming environment for                             be activity-centered: different activities should be
          distributed and parallel computing that exploits                             placed onto different JVMs. Objects should be co-
          heterogeneous resources. In order to use                                     located to activities such that method invocation is
          JavaSymphony efficiently, the programmer has to                              local. Local method invocation avoids network
          explicitly control data locality and load balancing. The                     communication and competing activities.
          structure of the computing resources has to be defined                           Haumacher proposes an iterative procedure [6] to
          manually. Since all objects must be created, mapped,                         assign objects to activities and then activities to virtual
          and freed explicitly, the handling of remote objects can                     machines. Based on a static type analysis, estimates for
          be quite cumbersome. JavaSymphony does not offer                             two values are derived: work(t, a) describes the
          assistance for those manual steps, so the semi-                              computing time that activity t spends on methods of
          automatic distribution is likely to be error-prone.                          object a, and cost(t, a) describes the communication
             ADAJ is an environment for the development and                            time that would be necessary if t and a were not
          execution of distributed Java applications. ADAJ is                          located in the same address space. Through the
          designed on top of JavaParty and is therefore most                           placement of object a, the computing time of that
          closely related to our work. The ADAJ project deals                          activity t should be maximized in which address space
          with placement and migration of Java objects. It                             a is created. At the same time, the sum of
          automatically deploys parallel Java applications on a                        communication cost that is required for those activities
          cluster of workstations by monitoring the application                        ti assigned to remote virtual machines should be
          behavior. ADAJ contains a load-balancing mechanism                           minimized.
          that considers changes in the evolution of the                                   We assume an initial setting where all objects are
          application. While the focus of ADAJ is to balance the                       located in a single address space with a single
          load between the individual JVMs, we concentrate on                          processor such that all method calls are local. In order
          optimizing the locality of the distributed application.                      to distribute objects to activities, we suppose that each
                                                                                       activity is running in a different address space with its
                                                                                       own processor. By placing object a in the address
          4. Design                                                                    space of activity t, method calls of a by t can be
                                                                                       executed parallel to other activities. Thus, work(t, a)
          4.1. Locality optimizations                                                  indicates the time that is gained by the placement of a
                                                                                       within the address space of t. The communication cost
              Philippsen and Haumacher proposed locality                               that other activities ti spend to access methods of a
          optimizations in JavaParty by means of static type                           break even if work(t, a) is greater than the sum of
          analysis [11]. They classify approaches to deal with                         cost(ti, a). So each object a can be mapped to an
          locality in parallel object-oriented languages in three                      activity t in which address space it should be placed:
          categories: (i) let the programmer specify placement
          and migration explicitly by means of annotations,                             activity(a) = t ⇔ t maximizes (work(t,a) − ∑ cost(t i , a))
                                                                                                                                                  ti ≠t
          (ii) static object distribution where the compiler tries to
          predict the best node for a new object, and                                     Since usually more activities are used than virtual
          (iii) dynamic object distribution based on a runtime                         machines are available, several activities must share a
          system that keeps track of the call graph. JavaParty                         virtual machine. Thus, it is necessary to identify
          already provides mechanisms for manual object                                groups of activities that should be executed on a shared
          placement and migration, so we focus on static and                           virtual machine. The parallelization win of each
          dynamic object distribution in the following.                                activity can be estimated by mapping each object to its
                                                                                       optimal activity. The parallelization win is computed
          4.1.1. Static object distribution. Although a Java                           by the sum of work(t, a) for objects a which reside in
          thread cannot migrate, the control flow (called activity                     the address space of activity t minus the sum of
          in the following) can: when a method of a remote                             cost(t, b) for objects b that are placed remotely:
          object is invoked, the activity conceptually leaves the
          JVM of the caller and is continued at the callee’s JVM
                                                                                                win(t) =         ∑ work(t, a) −
                                                                                                           {a|activity(a) = t}
                                                                                                                                        ∑ cost(t, b)
                                                                                                                                  {b|activity(b) ≠ t}
          where it competes with other activities. Due to time-

                                                                                 151

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
The sum of work(t, a) represents the computing                            4.2. Time measurements
          time that activity t spends in its own address space.
          This work is done in parallel to other activities if no                         Having developed a placement methodology for
          synchronization mechanisms are used. The time that is                        remote objects, we now focus on how to measure the
          spent for communication with other address spaces is                         time values required for the distribution algorithm.
          represented by the sum of cost(t, b) for all objects b                       Beginning with the Pentium processor, Intel allows the
          that are not assigned to activity t. Note that we charge                     programmer to access a time-stamp counter [8]. This
          the cost of a remote call to the activity that invoked the                   counter keeps an accurate count of every cycle that
          remote method, not to the activity that actually                             occurs on the processor since it is incremented every
          executes the method call.                                                    clock cycle, starting with zero. To access the counter,
             Activities are assigned to the available virtual                          programmers can use the RDTSC (read time-stamp
          machines in decreasing order of their parallelization                        counter) instruction. We use the counter to get an time
          wins until a single activity has been scheduled to each                      estimate for the duration of method invocations.
          virtual machine. For each remaining activity, a new                             Note that the time-stamp counter measures cycles,
          parallelization win is computed that accounts for the                        not time. Thus, comparing cycle counts only makes
          potential co-location with other activities. The activity                    sense on processors of the same speed – like in a
          is assigned to that group of activities with the highest                     homogeneous cluster environment. To compare
          combined parallelization win. This process is repeated                       processors of different speeds, the cycle counts should
          until all activities are scheduled to their optimal virtual                  be converted into time units. While the unit of time
          machine.                                                                     returned by currentTimeMillis() is a
                                                                                       millisecond, the granularity of the value depends on
              The result of the distribution analysis is a mapping                     the underlying OS and may be larger. Thus, the time-
          of each remote object to the virtual machine on which                        stamp counter also allows much finer measurements.
          it should be placed.                                                            To avoid measurement errors because of
                                                                                       concurrency, we assume that the workstations of the
                                                                                       cluster are used exclusively for JavaParty. In the
          4.1.2. Dynamic object distribution. While Philippsen                         presence of background jobs, cycle counting does not
          and Haumacher focus on static object distribution                            always reflect the real execution time of an application.
          through type analysis, we rely on dynamic object                             But in the long run, the interrupts through background
          distribution to improve locality. This approach is                           jobs are approximately the same for all workstations of
          reported to have two disadvantages: First, there is no                       a homogenous cluster. Thus, we assume that those
          knowledge about future call graphs as well as                                interrupts balance over time such that cycle counting
          invocation frequencies. Second, the creation of objects                      actually reflects the average execution time.
          that cannot migrate often results in a broad re-
          distribution of other objects. The first problem is                          4.3. Remote Method Invocation
          inherent to dynamic approaches, but can be softened
          by using heuristics to predict future behavior. The
                                                                                          RMI uses a standard mechanism for communicating
          second problem is not exactly an issue in
                                                                                       with remote objects – stubs and skeletons. A stub for a
          homogeneous cluster environments and can be handled
                                                                                       remote object acts as a local representative or proxy
          by avoiding cyclic redistributions of remote objects.
                                                                                       for the remote object. The stub hides the serialization
             Besides these problems, the dynamic approach has                          of parameters and the network communication whereas
          the essential advantage that instead of estimating the                       the skeleton is responsible for dispatching the call to
          values of work and cost, they can be measured: we                            the actual remote object implementation. We want to
          take work as the actual execution time of a method call                      measure work(t, a) and cost(t, a) in order to apply the
          and cost as the communication time of a remote                               distribution algorithm. In the context of stubs and
          method invocation. As detailed later, we have to                             skeletons, work corresponds to the time that the actual
          estimate the cost of remote calls that are actually                          method implementation takes and cost corresponds to
          executed locally because the called object resides on                        the time that is required for carrying out the remote
          the same node. We adapt Haumacher’s approach and                             call, i.e. marshaling and transmitting parameters and
          use an iterative procedure to distribute objects to                          result.
          activities and then assign activities to virtual machines.                      For remote object r, a stub is instantiated on each
          Objects are migrated to the virtual machine their                            node while only one skeleton is instantiated on the
          optimal activity is assigned to.                                             node where the implementation of r resides. That is,

                                                                                 152

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
there are n stubs and one skeleton for each remote                           values are stored in the skeleton using a special data
          object. Basically, our approach is to measure the                            structure described later.
          communication time of a remote call in the stub and
          the execution time of the implementation in the                              5.3. Estimation of cost
          skeleton by using the RDTSC instruction. We store
          aggregated work and cost values in the skeleton.                                An important optimization carried out by JavaParty
                                                                                       is that a call is only executed remotely if the called
          5. Implementation                                                            object actually resides on another node. Otherwise, the
                                                                                       call is executed locally. Recall that cost(t, a) estimates
          5.1. Time measurements                                                       the communication time that would be necessary if
                                                                                       activity t was not located on the same node as object a.
                                                                                       While we’re able to measure the actual communication
             Our framework for performance measuring wraps
                                                                                       time of remote calls, we have to estimate the cost of
          the RDTSC instruction described in the previous
                                                                                       local calls as if they were remote. Thus, we have to
          chapter using the Java Native Interface [13]. As
                                                                                       develop a model to estimate the communication cost
          detailed in Table 1, accessing the system time is orders
                                                                                       based on the measured cost of a local call.
          of magnitude more expensive than using the RDTSC
          instruction. Times were measured on a Pentium III 800                           Whenever the client and server objects are in the
          MHz system.                                                                  same address space, arguments and result are cloned to
                                                                                       preserve the copy semantics of a remote call. JavaParty
                                                                                       produces a deep clone with all referenced objects also
             Table 1. Cost of System.currentTimeMillis()                               being cloned. In the generated stubs, the instrumented
           Call                                    Cycles        Time                  version of the local short cut measures the cost of
           RDTSC.readccounter()                    613           0.77 µs               cloning arguments and return value.
           System.currentTimeMillis()              36941         46.18 µs                 The measuring can be divided into three parts:
                                                                                       cloning of the arguments, local method invocation, and
                                                                                       cloning of the result. Based on the measured local cost
          5.2. KaRMI                                                                   of cloning arguments and result, we estimate the
                                                                                       communication cost if the call was remote. For this
                                                                                       purpose we analyzed the results of a benchmark suite
              KaRMI [12] is a fast replacement for Java RMI. It
                                                                                       that measures the execution times of local and remote
          is based on an efficient object serialization mechanism
                                                                                       method calls for a representative set of parameter
          that replaces regular Java serialization. Since the
                                                                                       types.
          remote method invocation protocol is different from
          Java RMI, the format of stubs and skeletons is                                  Given the duration of a local call, we estimate how
          different, too. The KaRMI compiler generates stub and                        long a remote call takes. While the absolute values are
          skeleton classes from compiled remote classes. We                            likely to vary on different machines, the relation
          modified the generation of stubs and skeletons to                            between local and remote calls should approximately
          include code that measures the execution times of                            be the same. For simplicity, we assume a linear model
          remote calls. The measured times are processed by the                        with offset a and gradient b:
          distribution task to compute an optimal object                                               remote cost = a + b ⋅ (local cost)
          distribution.
              More precisely, we modified the generation of stubs                         We applied a nonlinear least-squares algorithm to
          to measure the total execution time of remote calls.                         the results of the benchmark suite in order to fit the
          Once a remote call returns, the stub sends the total time                    estimate function and determine the values of a and b.
          to the skeleton which measured the execution time of
          the actual implementation (i.e. work). Using both                            5.4. Smoothing and storing time values
          values, we compute cost as the difference between the
          total time and work.                                                            We use a hash map to store time values, mapping
              In order to transmit the total time from stub to                         activities to work and cost values. JavaParty assigns a
          skeleton, we added methods to send and receive the                           globally unique thread id to activities that face remote
          measured times to the client and server side of the                          calls. If a new measurement is to be stored, the given
          connection. These methods are called after a remote                          thread id is mapped to a pair of work and cost values.
          method invocation has been completed and the result is                       We store these values directly with the skeleton, so the
          marshaled back to the caller. Finally, the work and cost                     addressed object is implicit. Since work and cost

                                                                                 153

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
indicate the computing and communication times an                               The first application is a numerical algorithm that
          activity spends on all methods of an object, we have to                      has a static structure. We started with a sub-optimal
          aggregate the values of the individual methods in a                          distribution and optimized its locality during runtime.
          reasonable way.                                                              The second application is an n-body simulation with an
             We use an exponential moving average which has                            inherently dynamic structure. We started with an
          the following advantages over simply adding up the                           optimal distribution and adapted the locality as the
          time values: First, the weighting for each data point                        structure of the application changed.
          decreases exponentially, giving more importance to                              All measurements in this chapter have been
          recent observations while still not discarding older                         conducted on our Carla cluster, using the Java Server
          observations. Second, the weighting makes our                                VM 1.4.2_13-b06. This cluster consists of 16 nodes
          measurement more robust against outliers, e.g. delayed                       equipped with two Pentium III 800 MHz processors
          execution because of distributed garbage-collection.                         and 1 GB RAM each.
          Third, the exponential moving average is easy to
          compute and thus a relatively cheap operation.                               6.2. Successive over-relaxation

          5.6. Application monitoring                                                      Successive over-relaxation is a numerical algorithm
                                                                                       for solving Laplace equations on a grid. The sequential
              JavaParty offers an interface that allows plugging in                    implementation involves an outer loop for the
          additional classes that can be used for monitoring the                       iterations and two inner loops, each looping over the
          distributed environment. In our case, the monitor                            grid. During an iteration, the new value of each point
          interface is implemented as an invisible task that                           of the grid is determined by calculating the average
          collects runtime data based on instrumentation. This                         value of the four neighbor points. The algorithm
          data is used to analyze the distribution of remote                           terminates if no point of the grid has changed more
          objects over the virtual machines.                                           than a certain threshold.
              In JavaParty, references to remote objects are stored                        The parallel implementation [9] provided by
          in a distributed fashion. Thus, we have to iterate over                      Maassen is based on a red-black ordering mechanism.
          all virtual machines to obtain references to the remote                      The grid is partitioned among the available processors,
          objects. These references are used to collect the                            each processor receiving a number of adjacent rows.
          measured times.                                                              Before a processor starts to update the points of a
              The monitor also serves as front end for the                             certain color, it exchanges the border rows of the
          distribution task which can either be scheduled for                          opposite color with its neighbors.
          repeated fixed-delay execution or invoked manually
          via a library call. Basically, our distribution task                                                1000x1000 grid, 300 iterations
          fetches the measured times and runs the distribution
                                                                                                    120
          algorithm discussed in section 4.1.
              The distribution algorithm sorts the application                                      100

          threads according to their parallelization wins. Each                                      80
                                                                                        time [ms]

                                                                                                                                                    manual
          activity is assigned to a group of activities which are                                    60                                             optimized
          optimally placed on the same virtual machine. Finally,                                                                                    random
                                                                                                     40
          each object is assigned to its optimal JVM and possibly
                                                                                                     20
          migrated there. The migration succeeds only for
          objects that are not declared to be resident. If nothing                                    0
                                                                                                          2        4                8          16
          was changed during the migration, the distribution task
                                                                                                                       # machines
          is canceled.
                                                                                                    Figure 1. Results of the SOR benchmark
          6. Evaluation
                                                                                           The SOR benchmark performs 300 iterations of
             In order to evaluate the effectiveness and efficiency                     successive over-relaxation on a 1000x1000 grid of
          of our work, we examined two applications that have                          double values. The performance was measured on 2, 4,
          potential for locality optimizations. If a program was                       8, and 16 nodes and is reported in milliseconds per
          already distributed optimally at compile time and its                        iteration. In order to evaluate our approach, we created
          locality did not change during run time, there would be                      three versions of the benchmark: (i) a manual version
          nothing to optimize.                                                         that creates all remote objects at their optimal location,
                                                                                       (ii) a random version where the location of the remote

                                                                                 154

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
objects is determined randomly, and (iii) an optimized                          The benchmark performs 10 iterations of n-body
          version which invokes the locality optimizations after                       simulation with 1000 particles. The performance was
          the first iteration based on the random object                               measured on 2, 4, 8, and 16 nodes and is reported in
          distribution.                                                                seconds per iteration. Again, we created three versions
             The results of the SOR benchmark are shown in                             of the benchmark: (i) a manual version with explicit
          Figure 1. As expected, the manual version performs                           placement annotations, (ii) a random version where the
          best with a constant speedup as the number of                                location of the remote objects is determined randomly,
          machines increases. The random version performs                              and (iii) an optimized version which invokes the
          worst and does not scale with additional machines.                           locality optimizations after the first iteration based on
          Finally, the optimized version of the benchmark                              the random object distribution.
          performs considerably better than the random version,
          improving its performance towards the optimal                                    Figure 2 shows the results of the n-body
          version. If more iterations were performed, the                              benchmark. Because of the dynamic structure of the
          optimized version would do even better since the cost                        benchmark, an optimal distribution of the remote
          of the locality optimizations would bear less weight.                        objects is hard to predict and depends on the spatial
             Figure 1 might give the impression that the                               distribution of the particles. As the initial coordinates
          optimized version does not scale with additional                             of the particles are determined randomly and thus are
          machines. This is not exactly true since the cost of the                     not known a priori, the manual version of the
          locality optimizations is proportional to the number of                      benchmark performs only slightly better than the
          nodes, too. Table 2 details the cost of the procedure for                    random version. Since the locality of the application is
          the SOR benchmark. Polling the remote objects clearly                        adapted to the actual location of the particles, the
          dominates the overall cost. In spite of its square                           optimized version of the benchmark performs best.
          complexity, the cost of the distribution algorithm is                        The cost of the locality optimizations can easily be
          relatively small. Again, if the number of iterations was                     covered by the savings achieved during the following
          increased or a benchmark with longer processing times                        iterations.
          was used, the cost would decrease.

              Table 2. Cost of the locality optimizations                                                     1000 particles, 10 iterations
                  polling       computing         cost of          overall
                                                                                                   300
                  remote        locality          migrating        cost
                  objects       algorithm         objects          [ms]                            250

           2           929                43              235       1206,87                        200
                                                                                                                                                   manual
                                                                                        time [s]

           4          1799               137              249       2185,82                        150                                             optimized
           8          4044               332              588       4963,73                                                                        random
                                                                                                   100
           16         7000               652            1068        8720,30
                                                                                                    50

                                                                                                     0
                                                                                                         2        4                8          16
          6.3. N-body simulation                                                                                      # machines

             The n-body simulation approximates the movement                                Figure 2. Results of the n-body benchmark
          of n particles in a two-dimensional space based on
          mutual gravitation. The simulation is discretized into
          time steps where the gravity between each of the                                 The n-body benchmark is a good example for the
          n particles must be computed for each time step.                             effectiveness of our approach. In dynamic settings
          Afterwards, acceleration and change in velocity and                          such as the n-body simulation, it is hard and sometimes
          location are determined for each particle. In order to                       impossible to determine a good initial distribution of
          avoid the square complexity of computing forces, the                         the remote objects. Even if an optimal distribution can
          present implementation uses an approximation                                 be determined, the performance of the initial
          proposed by Barnes and Hut. Through hierarchical                             distribution will decrease since the locality of the
          grouping and generation of substitute masses for                             application changes. Only a dynamic approach that
          distant space regions, the computation complexity is                         optimizes the locality at runtime can guarantee
          reduced to O(n log(n)) operations per time step. We                          consistently high performance throughout the whole
          refer to [7] for a detailed description of the benchmark.                    life cycle of the application.

                                                                                 155

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
7. Conclusion and future work                                                [4] T. Fahringer, “JavaSymphony: a system for
                                                                                            development of locality-oriented distributed and parallel
                                                                                            Java applications”, Cluster Computing, 2000.
             In this work, we presented runtime locality                               [5] V. Felea, R. Olejnik, and B. Toursel, “ADAJ: a Java
          optimizations of distributed Java applications. Based                             Distributed Environment for Easy Programming Design
          on a static approach, we developed a dynamic                                      and Efficient Execution”, Shedae Informaticae, UJ
          methodology to automatically generate a distribution                              Press, Krakow, 2004, pp. 9-36.
          strategy for the objects of a distributed system. We                         [6] B. Haumacher, “Lokalitätsoptimierung durch statische
          instrumented stubs and skeletons to measure the                                   Typanalyse in JavaParty“, Diploma theses, Institute for
          execution time and communication cost of remote                                   Program Structures and Data Organization, University
          calls. The measured time values are stored locally to                             of Karlsruhe, January 1998.
                                                                                       [7] B. Haumacher, “Plattformunabhängige Umgebung für
          avoid communication overhead. The locality
                                                                                            verteilt paralleles Rechnen mit Rechnerbündeln“, PhD
          optimizations are implemented as a task that runs                                 thesis, Institute for Program Structures and Data
          periodically or can be started on demand. This task                               Organization, University of Karlsruhe, October 2005.
          collects the measured time values and computes an                            [8] Intel Corp, “Using the RDTSC Instruction for
          optimal distribution strategy. In order to realize the                            Performance Monitoring”, 1997.
          distribution strategy, objects are migrated between                               http://developer.intel.com/drg/pentiumII/appnotes/RDT
          machines.                                                                         SCPM1.HTM
             We evaluated the effectiveness and efficiency of                          [9] J. Maassen and R.V. Nieuwpoort, “Fast parallel Java“,
          our work by optimizing two benchmark applications.                                Master's thesis, Dept. of Computer Science, Vrije
          The first benchmark is a typical example of a                                     Universiteit, Amsterdam, August 1998.
                                                                                       [10] M. Philippsen and M. Zenger, “JavaParty - Transparent
          numerical algorithm with a static structure, so we
                                                                                            Remote Objects in Java”, Concurrency: Practice and
          created a random initial distribution of the objects and                          Experience, November 1997.
          optimized their locality at runtime. The second                              [11] M. Philippsen and B. Haumacher, “Locality
          benchmark has a dynamic structure, so that the                                    optimization in JavaParty by means of static type
          performance of the initial object distribution – even of                          analysis”, Proc. Workshop on Java for High
          an optimal one – will deteriorate at runtime. We have                             Performance Network Computing at EuroPar '98,
          shown that our approach is particularly suitable for                              Southhampton, September 1998.
          such dynamic settings.                                                       [12] M. Philippsen, B. Haumacher, and C. Nester, “More
             In future work, we will focus on automatically                                 Efficient Serialization and RMI for Java”, Concurrency:
                                                                                            Practice and Experience, John Wiley & Sons,
          adapting the periodic time of the distribution task such
                                                                                            Chichester, West Sussix, May 2000, pp. 495-518.
          that it reflects the processing time of the application. If                  [13] Sun Microsystems, “Java Native Interface”, 2003.
          the structure of the application does not change, we                              http://java.sun.com/j2se/1.4.2/docs/guide/jni/
          might even want to switch off the measuring                                  [14] Sun Microsystems, “Java Remote Method Invocation
          completely. For large clusters with thousands of                                  Specification”, 2003.
          processors or applications with a great number of                                 http://java.sun.com/j2se/1.4.2/docs/guide/rmi/spec/rmiT
          objects, an algorithm with square complexity might be                             OC.html
          suboptimal. We could imagine a distributed algorithm                         [15] R. Veldema, R. A. F. Bhoedjang, and H. E. Bal,
          that works with exact time values for only a couple of                            “Jackal, a compiler based implementation of java for
                                                                                            clusters of workstations”, Proceedings of PPoPP, 2001.
          local nodes and extrapolates the values for remote
                                                                                       [16] W. Zhu, C.-L. Wang, and F. C. M. Lau, “JESSICA2: A
          nodes.                                                                            Distributed Java Virtual Machine with Transparent
                                                                                            Thread Migration Support”, IEEE Fourth International
          References                                                                        Conference on Cluster Computing, Chicago, USA,
                                                                                            September 2002.
          [1] G. Antoniu, L. Bouge, P. Hatcher, M. MacBeth, K.
              McGuigan, and R. Namyst, “The Hyperion system:
              Compiling multithreaded Java bytecode for distributed
              execution”, Parallel Computing, 2001.
          [2] Y. Aridor, M. Factor, and A. Teperman, “cJVM: a
              single system image of a JVM on a cluster”, Parallel
              Processing, 1999, pp. 4-11.
          [3] M. Factor, A. Schuster, and K. Shagin, “A distributed
              runtime for Java: yesterday and today”, Parallel and
              Distributed Processing Symposium, 2004.

                                                                                 156

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.
You can also read