The MUMPS Solver: academic needs and industrial expectations - MUMPS group

Page created by Tyrone Floyd
 
CONTINUE READING
The MUMPS Solver: academic needs and industrial expectations - MUMPS group
The MUMPS Solver: academic needs
and industrial expectations
Chiara Puglisi (Inria-Grenoble (LIP-ENS Lyon))

MUMPS group,                CERFACS, CNRS, ENS-Lyon, INRIA, INPT, Université
Bordeaux 1
Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
The MUMPS Solver: academic needs and industrial expectations - MUMPS group
Outline

   Academic needs: a research platform for sparse direct solvers

   Industrial expectations: MUMPS solver a software platform

   Concluding remarks: research and software perspectives

2/24                   Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
The MUMPS Solver: academic needs and industrial expectations - MUMPS group
Outline

   Academic needs: a research platform for sparse direct solvers

   Industrial expectations: MUMPS solver a software platform

   Concluding remarks: research and software perspectives

3/24                   Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Academic needs: a research platform

                                     Solution of sparse
              Code Aster, Carter
                                  → systems
              (e.g., finite elements)
                                     Ax = b
      Often the most expensive part in numerical simulation codes
  Sparse direct methods to solve Ax = b:
   • Decompose A under the form LU ,LDLt or LLt
   • Solve the triangular systems Ly = b, then U x = y
                                           3D example in earth science:
                                           acoustic wave propagation,
                                           27-point finite difference grid

                                           Current goal [Seiscope project]:
                                           LU     on      complete      earth
                                           n = N 3 = 10003

  Extrapolation on a 1000 × 1000 × 1000 grid: 55 exaflops, 200 Tbytes for
  factors, 40 TBytes for active memory!
Sparse direct solution: main research issues
                                                                                                  Dip (km)
                                                                                   0          5         10   15   20
                                                                              20

                                                                        15                                             Frequency domain
                      Code Aster,

                                                                  )
                                                                   m
                                                                 (k
                                                            ss
                                                                 10

                                                                                                                       seismic   modeling,

                                                            ro
                                                        C
                                                        5

                      EDF Pump,                     0
                                                                                                                       Helmholtz     equa-

                                       Depth (km)
                                                    1
                                                    2

                      nuclear backup                3
                                                    4

                                                3000                   4000            5000         6000
                                                                                                                       tions,   SEISCOPE
                      circuit                                                 m/s

                                                                                                                       project
   Extrapolation on a 1000 × 1000 × 1000 grid:
   55 exaflops, 200 Tbytes for factors, 40 TBytes for active memory!

   Main algorithmic issues
       • Parallel algorithmic issues: synchronization avoidance, mapping
         irregular data structures, scheduling.
       • Performance scalability: time but also memory/proc when increasing
         number of processors (and problem size).
       • Numerical issues: numerical accurary, hybrid iterative-direct solvers,
         application (elliptic PDEs) specific solvers

5/24                        Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Robust memory-aware mappings

  Context
                       Factors Active Memory         Factors Active Memory

                         Disk   NODE
                                               ...     Disk   NODE
  Memory per node                                                            Active memory not
  or core                                                                    naturally scalable,
  is decreasing       Factors Active Memory          Factors Active Memory   difficult to estimate
                        Disk
                                               ...     Disk
                                NODE                           NODE

  Algorithmic work

  • Design mapping algorithms that enforce some memory constraints
    and provide better memory estimates.
  • Active memory size dominates total memory in parallel,
    Example: share of active storage on the AUDI matrix
     1 processor: 11%
    256 processors: 59%
Robust memory-aware mappings (problem)

   Metric: active memory efficiency

                                            Sseq
                               e(p) =
                                        p × Smax (p)
       with Sseq sequential memory; Smax (p) maximum memory used on p procs

   We would like e(p) ' 1, i.e. Sseq /p on each processor.

       Common mappings/schedulings → poor memory efficiency:

       • Standard proportional mapping: lim e(p) = 0 on regular problems.
                                           p→∞
       • With more sophisticated relaxed proportional mapping, typical
         efficiency e(p) is still between 0.10 and 0.40. (Memory estimates
         are unreliable).

7/24                      Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Robust Memory-Aware mappings (results)

  • Reduce memory ↔ serialize some branches in the elimination tree

  ⇒ Reliable estimation and better memory use with Memory-Aware
  with respect to default version (MUMPS 4.10.0).

  Illustration with matrix PANCAKE 2 (3D electromagnetism, Cedrat
  (Flux) and Padova Univ.), 64 MPI processes
                                     MUMPS Memory-aware
                                        4.10.0    mappings
    Objective max MB/core                  n/a   400      200
    Time (seconds)                         418   591      684
    Active workspace (avg MB/core)       539.4 234.7    180.0
    Active workspace (max MB/core)       900.3 356.2    181.5
Application specific solvers : BLR solver
                Block Low-Rank approximations to improve
                       sparse multifrontal solvers

  Low-rank approximations (Elliptic PDE’s)

          • memory compression and flop reduction
          • accuracy controlled by a numerical parameter
            (→ can also be used as a preconditioner)

   Main features of Block Low Rank (BLR) format

   • Algebraic solver; flat and simple format
   • Compatibility with numerical pivoting

  ⇒ Many representations: Recursive H, H2 [Bebendof, Börm, Hackbush,
  Grasedyck,. . . ], HSS/SSS [Chandrasekaran, Dewilde, Gu, Li, Xia,. . . ], Flat block
  low-rank (BLR) . . .
Block Low Rank multifrontal solver

                          ⇒
                                         Elimination tree

                  ⇒
                         Singular value decomposition (SVD) of each
        B
                         block B ⇒ B = X1 S1 Y1 + X2 S2 Y2

10/24             Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Block Low Rank multifrontal solver

                          ⇒
                                         Elimination tree

                  ⇒
                         rank k(ε): B = X1 S1 Y1 +X2 S2 Y2
        B                kEk2 = kX2 S2 Y2 k2 = σk+1 ≤ ε → Block
                         Low-Rank Solver (BLR), PhD INP-EDF, 2013,
                         C. Weisbecker
10/24             Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Application to frequency-domain seismic modeling
                                                    Dip (km)                                                          Dip (km)                                                             Dip (km)                                                          Dip (km)
                                            0   5         10   15                    20                       0   5         10    15                    20                       0     5         10   15                    20                       0   5         10   15   20
                                       20                                                                20                                                                 20                                                                  20

                              )   15                                                                15                                                                 15                                                                  15

                                                                                                )

                                                                                                                                                                   )

                                                                                                                                                                                                                                       )
                             m

                                                                                               m

                                                                                                                                                                  m

                                                                                                                                                                                                                                      m
                             (k

                                                                                               (k

                                                                                                                                                                  (k

                                                                                                                                                                                                                                      (k
                        ss

                                                                                          ss

                                                                                                                                                             ss

                                                                                                                                                                                                                                 ss
                             10                                                                10
                        ro

                                                                                          ro
                                                                                                                                                                  10                                                                  10

                                                                                                                                                             ro

                                                                                                                                                                                                                                 ro
                    C

                                                                                      C

                                                                                                                                                         C

                                                                                                                                                                                                                             C
                    5                                                                5                                                                  5                                                                   5

                0                                                                0                                                                  0                                                                   0
   Depth (km)

                                                                    Depth (km)

                                                                                                                                       Depth (km)

                                                                                                                                                                                                           Depth (km)
                1                                                                1                                                                  1                                                                   1
                2                                                                2                                                                  2                                                                   2
                3                                                                3                                                                  3                                                                   3
                4                                                                4                                                                  4                                                                   4

                                                                                                                                  ops                                                       memory
                                                                    ε                                    fqcy                                                                         |L|                                   |CB|
                                                           (10−5 )                                       2 Hz                    41.8 %                                              61.8 %                                 32.3%
                                                                                                         4 Hz                    27.4 %                                              50.0 %                                 24.4%
                                                                                                         8 Hz                    21.8 %                                              41.6 %                                 23.9%
                                                           (10−4 )                                       2 Hz                    32.9 %                                              53.4 %                                 23.9%
                                                                                                         4 Hz                    20.0 %                                              42.2 %                                 21.7%
                                                                                                         8 Hz                    15.2 %                                              28.9 %                                 19.4%

                                                          % : percentage of standard (full-rank) sparse solver
11/24                                                                                          Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Outline

   Academic needs: a research platform for sparse direct solvers

   Industrial expectations: MUMPS solver a software platform

   Concluding remarks: research and software perspectives

12/24                  Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Industrial expectations: a software platform
  Technological transfer
   • From research prototyping during PhD thesis to robust and
     portable software. Examples:
     ◦ Memory Aware : PhDs E. Agullo (LIP-ENS, 2008) and F.-H. Rouet
       (INPT-IRIT, 2012);
     ◦ Block Low Rank: PhD C. Weibecker (INPT-IRIT with EDF support,
       2013).

  Software issues and interaction with users
   • Code development: develop and combine complex features
   • Software engineering: analysis/experimentation/validation tools,
     maintenance (also essential for research developments !)
   • Users: expect support, training and adaptation/developments but
     also: research collaborations, software validation and financial
     support.
MUMPS solver software platform

  General context

  • Initially funded by European project                (1996-1999),
    12 partners from 5 countries
  • Publically available since 1999 at http://graal.ens-lyon.fr/MUMPS
    and http://mumps.enseeiht.fr
  • Co-developed in Toulouse, Lyon-Grenoble, Bordeaux by CERFACS,
    CNRS, ENS Lyon, INPT, Inria, Univ. Bordeaux
  • Latest release MUMPS 4.10.0, May 2011, ≈ 250 000 lines of C
    and Fortran code

  Competitive and original software package used worldwide
  • Integrated within commercial and open-source packages (e.g.,
    Samcef from Samtech, Actran from Free Field Technologies, Code Aster from
    EDF, PAM-Crash from ESI, IPOPT, Petsc, Trilinos, Debian packages, . . . ).
Software requests

   World Map since Dec. 2002 (8839 requests)

15/24               Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Software requests

   The number of requests per day has increased steadily throughout the
   evolution of the software

                                              Requests per day

                  4.5
                                                                               4.02
                   4
                                                                        3.52
                  3.5

                   3                                             2.84

                  2.5
                                                      2.04
                   2
                                       1.58
                  1.51.3        1.31

                   1

                  0.5

                   0
                        4.3      4.5    4.6           4.7        4.8    4.9    4.10

                                                MUMPS releases

   The latest version (4.10.0) is downloaded more than 1000 times per
   year
16/24                         Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
MUMPS Team (May 2014)

   Permanent members:
    Patrick Amestoy (INPT-IRIT, Toulouse)
    Jean-Yves L’Excellent (INRIA-LIP, Lyon)
    Abdou Guermouche (LABRI, Bordeaux)
    Bora Uçar (CNRS-LIP, Lyon)
    Alfredo Buttari (CNRS-IRIT, Toulouse)

   Engineers:
    Guillaume Joslin (Université Paul Sabatier,
    Toulouse)
    Chiara Puglisi (INRIA, Grenoble)
    Part time on MUMPS: Maurice Brémond
    (INRIA, Grenoble)

                            PhD Students:
                            Mohamed Sid-Lakhdar (ENS-Lyon)
                            Florent Lopez (UPS, Toulouse)

17/24                      Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
2000-2013: Research through PhD’s

   Ph.D. students connected to the project:
                                                                                                 F. Lopez, UPS
                                                                                       W. Sid-Lakhdar, ENS Lyon
                                                                                C. Weisbecker, INPT-EDF
                                                                         F.-H. Rouet, INPT
                                              M. Slavova, CERFACS
                                             E. Agullo, ENS Lyon
                       S.Pralet, CERFACS
                A. Guermouche,ENS Lyon
          C. Voemel, CERFACS

        2000   2001   2002   2003   2004   2005   2006   2007   2008   2009   2010   2011   2012     2013   2014

   Some research themes: Preprocessing and orderings, Numerical
   pivoting and accuracy, Numerical features, Memory usage and task
   scheduling, Shared-memory parallelism

18/24                                Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Relations with our users
  Exchanges with users

   • Direct contacts by email
   • MUMPS Users Mailing list

  MUMPS Users Days
   1 October 24th, 2006, Lyon, France
   2 April 15th - 16th, 2010, Toulouse, France
   3 May 29th - 30th, 2013, EDF, Clamart, France
  Objectives of these workshops:
   • Present some facets of the algorithmic, numerical and software
     work in the context of the MUMPS project/solver
   • Share experience
   • Identify users expectations (software evolution, new features)
   • Discuss future research tracks and future of MUMPS
Outline

   Academic needs: a research platform for sparse direct solvers

   Industrial expectations: MUMPS solver a software platform

   Concluding remarks: research and software perspectives

20/24                  Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Research perspectives

   Scientific hurdles and related research areas
    • Computation driven by memory: Memory-aware algorithms
    • Controlled accuracy to improve complexity: BLR Solver
    • Multicore and asynchronous communications: key issue for
        time and memory scalability, algorithms and communication
        schemes need be revisited.

   Performance projection and target (3D Helmholtz; n = 109 ; 1.4 PFlops
   computer, 2000 nodes, 32 core/node)
    (Still much research and software work needed to reach this target !!)

                                MUMPS 4.10.0         Research target
                 Time             107 seconds           104 seconds
                 Factors           8 GB/core             3 GB/core
                 Workspace        50 GB/core             2 GB/core
21/24                      Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Software agreement

   Software agreement
   signed by owners of the software: CERFACS, CNRS, ENS Lyon,
   INPT, Inria, Univ. Bordeaux 1.

   Key features

    • All institutions have recognized and confirmed their will to freely
        distribute MUMPS releases
    • A technical committee supervises technical/scientific decisions
    • Conditions of use for development version defined
    • Conditions of transfer toward next public version defined
    • License for public versions: Cecill-C (LGPL-compatible)

22/24                   Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
Sustainability of MUMPS software and research platform

  Objectives

   • Stabilize engineering work and expertise with long-term positions
   • Ensure software quality and faster transfer research work

  MUMPS Consortium
   • Type: group of users
   • Objective: support engineer work
   • Services: beta-release of future/new functionalities, annual
     meeting to share experience, wish list to influence priority in
     development, training cycles . . .

  On going work . . . takes more time than one could have expected
References I

24/24          Séminaire Aristote - HPC-Desk — ONERA, France, May 20th, 2014
You can also read