Year of optimizations on the first scalar

Page created by Erica Perry
 
CONTINUE READING
Year of optimizations on the first scalar
1 year of optimizations
               on the first scalar
           Cliquez pour modifier le style du titre
                supercomputer
                of Meteo-France
           Cliquez pour modifier le style des
           P. Marguinaud & R. El Khatib (Météo-France)
                 sous-titres du masque

24th Aladin Workshop – Hirlam All Staff Meeting 2014
                                                         1
             Bucharest 07-10/04/2014
Year of optimizations on the first scalar
Plan

       Main changes in the software configurations
       Specific optimizations for ARPEGE
       Parallelization of FESTAT
       FULLPOS-2
        – Optimizations
        – The Boyd biperiodicization
     Other enhancements
        – post-processing server
        – FA-LFI
        – IO server

2
Year of optimizations on the first scalar
Main changes in the software configurations

         Feature         On NEC at M-F           On Bull at M-F
        NPROMA         Large value (~3582)      Small value (~50)
       MPI gridpoint    North-South only          North-South
        distribution                             and East-West
       MPI spectral      On waves only             On waves
       distribution                           and on vertical levels
     LEQ_REGIONS            .FALSE.            .TRUE. (except for
       (ARPEGE)                                AROME couplers)
        MPI « I/O »                             Maximum values
        distribution      Small values              unless
    (NSTRIN,NSTROUT)                              I/O server
     LOPT_SCALAR            .FALSE.          .TRUE. (to be revisited)
        Open-MP             Not used                   Used
                                              (but not yet in 3DVar)

3
Year of optimizations on the first scalar
Specific optimizations for ARPEGE

     Computation of Gaussian reduced grids :
        – Merged with ECMWF program
        – Optimized even more for Open-MP
     Computation of filtering matrixes for the post-processing :
        – Calculation of dilatation/contraction matrixes :
            • fully recoded to use the algorithm of Strassauber for Legendre
              polynomial computation
            • Distributed/parallelized
            • Modular
        – Calculation of filtering matrixes themselves
            • Distributed/parallelized
            • Interfaced with the modular calculation of dilatation/contraction matrixes
    => On-line recomputation of filtering matrixes is fast.
       No need anymore for huge matrixes on files.

4
Parallelization of FESTAT (1)

                  FESTAT : Forecast Error STATistics :
       a mono-task program developped during the years 1993-1997
         which reads a lot of files and makes a lot of summations

     Substantial re-write and cleaning, DrHook instrumentation
     Optimizations by inverting loops and re-organization of calculations
      in order to limit repeated computations
     Memory savings
     Development of an « in-core » option to read the files only once
     MPI distribution on the files => needs to re-order the summation
      independently of the members for bitwise reproducibility
     MPI distribution + Open-MP parallelization on the k* wave numbers
     MPI distribution on the m waves is possible but not coded yet

5
Parallelization of FESTAT (2)
                          6000
                                                          Efficiency
        Real time (s)     5000
                          4000
                          3000
                          2000
                          1000
                             0
                             4
                                  8

                                 16

                                           22

                                                                                                     8
                                 12

                                                27

                                                     36

                                                              54
                                                                        Number of nodes

                                                                                                   10
                                                          Scalability
                           30
                           25
                           20
               Speed Up

                           15
                           10                                                             Festat
                            5                                                             Ideal
                            0
                                      16

                                                              54
                             4
                                  8
                                 12

                                           22

                                                27

                                                     36

                                                                                                     8
                                                                                                   10
                                                                        Number of nodes

    108 members, 90 levels, truncation 719 x 767
    72 nodes (54 tasks x 12 threads) => 11 min.
6
Parallelization of FESTAT (3)
                                                                                      &NAMFESTAT
                    Getting started with parallel FESTAT :                            NFLEV=60,
                                                                                      NCASES=72,
           OMP_NUM_THREADS=$(number of threads)                                       LSTABAL=.true.,
                                                                                      LANAFBAL=.true.,
           Namelist file NAMFESTAT in fort.4                                          OUTBAL='stabbal',
                                                                                      OUTCVT='stabcvt',
           mpirun -np $(number of mpi tasks) FESTAT                                   OUTCVU='stabcv',
                                                                                      ELON1=351.613266296832,
                                                                                      ELAT1=37.33305050,
                                100 % in memory                 Partly on disk
                                                                                      ELON2=15.74536410,
                    40000                                                             ELAT2=53.02364987,
                    35000
                                                                                      ELON0=2.00000000,
                    30000
                                                                                      ELAT0=45.80000000,
    Billing units

                    25000
                                                                                      NMSMAX=374,
                    20000
                                                                                      NSMAX=359,
                    15000
                    10000                                                             NDGL=720,
                     5000                                                             NDLON=750,
                        0                                                             NDGUX=709,
                            1    2      3    4    5    6    7       8     9      10   NDLUX=739,
                                     Number of member files per node                  EDELY=2500.00000000,
                                                                                      EDELX=2500.00000000,
                                                                                      LELAM=.TRUE.,
                    => read festat_guidelines.pdf for more                            /

7
Optimization of Fullpos-2

                                Efficiency                                       Cost of FULLPOS-927 vs FULLPOS-2

                              Cycle 40     Cycle 40t1                                  Coupling ARPEGE T1198 to AROME 1.3 km

                    100                                                          140

                    90                                                           120

                    80                                                           100

                                                            Real time (s)
    Real time (s)

                                                                                  80
                    70
                                                                                  60
                    60
                                                                                  40
                    50
                                                                                  20
                    40
                          3      4          5           6                          0

                                                                                                    =9                      =9                      =3
                              Number of nodes                                                D ES                    D ES                    D ES
                                                                                       7   NO                      NO                      NO
                                                                                  92                          -2                      -2
                                                                                P-                       FP                      FP
                                                                            F

8
The Boyd biperiodicization in Fullpos-2 (1)

     In New developments have been added :

       – The interpolation grid and the target grid can be different
       – => Needs to move the biperiodicization
            immediately after the horizontal interpolation
       – => Makes less MPI communications
       – => Biperiodicization code is easier to maintain
       – => Switching from splines to Boyd biperiodicization
         becomes easy :
                       just set in namelist NAMFPC:
                                NFPBOYD=1

9
The Boyd biperiodicization in Fullpos-2 (2)

      Difference Splines/Boyd on the U-wind at the lowest level

10
The Boyd biperiodicization in Fullpos-2 (3)

     Difference of the biperiodicization Before/After the vertical
           interpolations on the U-wind at the lowest level

11
Fullpos-927 will no longer be supported
                   in cycle 41
       Difference between NFPOS=927 and NFPOS=2
                     on surface pressure

12
Post-processing server

      Start the model once, post-process multiple files (same geometry)
       → save the setup time
      Simple design (loop in cnt2)
      Simple to use (test case available – ask P. Marguinaud)
      Savings ~ 20 %

                                     FA - LFI
       FA and LFI have a thread safe interface
         → possible to read/write multiple files using OpenMP
          – OK with GRIB Edition 0
          – Not yet with GRIB Edition 1 (aka Gribex)

13
IO server

      Field transposition is now performed by the IO server tasks (no more
       GATH_GRID/GATH_SPEC in the model)

      Synchronization tools (retrieve model data as soon as produced) ; test
       case available – ask P. Marguinaud

      The IO server creates several files for each forecast term
         → make the LFI library handle this transparently, no need to
        concatenate files (an index is created over multiple LFI files)

      Read coupling files with the IO server ; saves 2.5 % of time on AROME
       1.3 km

14
Summary / Conclusion

      Ready for higher resolutions :
       – Arome 1.3 km L90
       – Arpege T1198L105
      Among next plans :
       – Toward higher scalability
          • Fullpos-2 in a single 4DVar binary for OOPS
          • ...
       – Continuing porting
          • Parallelization of FESTAT-Arpege
          • Conf. 903 including the « PP-server » facility to replace
            the monotask conf. 901 (« IFS-to-FA » file transformation)
          • ...

15
Cliquez pour modifier le style du titre
Thank you for your attention !

Cliquez pour modifier le style des
     sous-titres du masque

                                          16
You can also read