Master-Seminar: Hochleistungsrechner - Aktuelle Trends und Entwicklungen Winter Term 2017/2018

Page created by Lewis Reese
 
CONTINUE READING
Master-Seminar: Hochleistungsrechner - Aktuelle Trends und
                          Entwicklungen
                      Winter Term 2017/2018
     Fault Tolerance in High Performance Computing
                                              Simon Klotz
                                     Technische Universität München

Abstract                                                      crease in clock speed is limited, and it is expected
                                                              that the number of components per system will con-
Fault tolerance is one of the essential characteristics       tinue to grow to improve performance [2] following
of High-Performance Computing (HPC) systems to                the trend of the current Top500 HPC systems [3]. A
reach exascale performance. Higher failure rates              higher number of components makes it more likely
are anticipated for future systems because of an in-          that one of them fails at a given time. Furthermore,
creasing number of components. Furthermore, new               the decreasing size of transistors increases their vul-
issues such as silent data corruption arise. There            nerability.
has been significant progress in the field of fault
                                                              The resilience challenge [4] encompasses all issues
tolerance in the past few years, but the resilience
                                                              associated with an increasing number of faults and
challenge is still not solved. The most widely used
                                                              how to deal with them. Commonly used meth-
global rollback-recovery method does not scale well
                                                              ods such as coordinated checkpointing do not scale
enough for exascale systems which is why other ap-
                                                              well enough to be suited for exascale environments.
proaches need to be considered. This paper gives
                                                              Thus, the research community is considering other
an overview of established and emerging fault tol-
                                                              approaches such as ABFT, fault prediction or repli-
erance methods such as rollback-recovery, forward
                                                              cation to provide a scalable solution for future HPC
recovery, algorithm-based fault tolerance (ABFT),
                                                              systems. This paper presents existing and current
replication and fault prediction. It also describes
                                                              research in the field of fault tolerance in HPC and
how LAIK can be utilized to achieve fault tolerant
                                                              discusses their advantages and drawbacks.
HPC systems.
                                                              The remainder of this paper is structured as fol-
                                                              lows: Section 2 introduces the terminology of fault
1     Introduction                                            tolerance. Fault recovery methods such as rollback-
                                                              recovery, forward recovery and silent error detection
Fault tolerance is the ability of an HPC system to            are covered in Section 3. Section 4 discusses fault
avoid failures despite the presence of faults. Past           containment methods such as ABFT and replica-
HPC systems did not provide any fault tolerance               tion. Section 5 is concerned with fault avoidance
mechanisms since faults were considered rare events           methods including failure prediction and migration.
[1]. However, with the current and next generation            Finally, Section 6 presents LAIK, an index space
of large-scale HPC systems faults will become the             management library, that enables all three classes
norm. Due to power and cooling constraints, the in-           of fault tolerance.

                                                          1
2     Background
First, essential concepts of the field have to be in-
troduced for an efficient discussion of current fault
tolerance methods. This paper relies on the defini-
tions of Avizienis [5], who defines faults as causes
of errors which are deviations from the correct total
state of a system. Errors can lead to failures which
imply that the external and observable state of a
system is incorrect. Faults can be either active or
dormant whereas only active faults lead to an error.
However, dormant faults can turn into active faults
by the computation or external factors. Further-
more, faults can be classified into permanent and             Figure 1: Classification of faults, errors, and fail-
transient faults. Transient faults only appear for a          ures
certain amount of time and then disappear with-
out external intervention in contrast to permanent
faults which do not disappear. Errors can be clas-            distinguish between software, hardware, and unde-
sified as detected, latent or masked. Latent errors           termined failures. Depending on the studied system
are only detected after a certain amount of time and          the occurence of different failure classes differ signif-
masked errors do not lead to failures at all. Errors          icantly. A possible explanation is the discrepancy
that lead to a failure are also called fail-stop errors       in system characteristics and used failure classes.
whereas undetected latent errors that only corrupt            Nevertheless, most researchers agree that software
the data are called silent errors. Silent errors have         and hardware faults are the most frequent causes
been identified to be one of the greatest challenges          of failures [7]. Hardware faults can occur in any
on the way to exascale resilience [1]. As an example,         component of an HPC system, such as fans, CPUs,
cosmic radiation impacting memory and leading to              or memory. One of the most important classes of
a discharge would be a fault. Then, the affected              hardware faults are memory bit errors due to cos-
bits flip which would be an error. If an applica-             mic radiation. Software faults can occur at any part
tion now uses these corrupted bits the observable             of the software stack. The most common faults in-
state would deviate from the correct state which is           clude unhandled exceptions, null reference pointers,
a failure.                                                    out of memory error, timeouts and buffer overflows
An important metric to discuss fault tolerance                [6]. Since hardware and software faults are most
mechanisms is the Mean Time Between Failures                  common in HPC systems, the remainder of this pa-
(MTBF). This paper is following the definition of             per will focus on these two classes since also most
Snir et al. [6] who define it as                              fault tolerance methods apply to them.

                         Total runtime
           M T BF =                                           3     Fault Recovery
                       Number of failures
Several studies [7, 8] were conducted to classify             The objective of fault recovery is to recover a
failures and determine their source in HPC sys-               normal state after the application experienced a
tems. In order to analyze failures researchers de-            fault, thus preventing a complete application fail-
fine different failure classes based on their cause.          ure. Fault recovery methods can be divided into
Schroeder et al. [7] distinguish between hardware,            rollback-recovery, forward recovery, and silent error
software, network, environment, human, and un-                detection. Rollback-recovery is the most widely in-
determined failures whereas Oliner et al. [8] only            vestigated approach and commonly used in many

                                                          2
large-scale HPC systems. Rollback-recovery meth-             are based on the assumption that the application
ods periodically take snapshots of the application           is piecewise deterministic and that the state of a
state and in case of failure the execution reverts to        process can be recovered by replaying the recorded
that state. Forward recovery, avoids this restart            messages from its initial state [9]. In practice, each
by letting the application recover by itself using           process periodically checkpoints its state so that re-
special algorithms. A novel research direction is            covery can continue from there instead of the ini-
silent error detection which can be combined with            tial state [12]. In contrast to coordinated proto-
rollback-recovery and forward recovery to recover            cols, only the processes affected by failure need to
from faults.                                                 be restarted which makes it more scalable. How-
                                                             ever, each process needs to maintain multiple check-
3.1    Rollback-recovery                                     points which increases memory consumption and
                                                             inter-process dependencies can lead to the so-called
Rollback-recovery, which is also called checkpoint-          Domino effect [9]. Furthermore, all messages and
restart, is the most widely used and investigated            their content need to be stored as well.
fault tolerance method [9, 10]. Checkpointing
methods can be classified into multiple categories
according to how they coordinate checkpoints and
at which level of the software stack they operate.
Each class has different approaches trying to solve
the challenge of saving a consistent state of a dis-
tributed system which is not a trivial task to achieve
since large-scale HPC systems generally do not em-
ploy any locking mechanisms. The two major ap-
proaches to checkpoint a consistent state of a HPC           Figure 2: Difference of coordinated protocols and
system are coordinated checkpointing and uncoor-             uncoordinated protocols with message-logging [11]
dinated checkpointing with message-logging [10] as
depicted in Figure 2.                                        Message-logging protocols can be further divided
Coordinated protocols store checkpoints of all pro-          into: optimistic, pessimistic and causal protocols
cesses simultaneously which imposes a high load on           [13]. Optimistic protocols store checkpoints on un-
the file system. They enforce a synchronized state           reliable media, e.g. the node running a process it-
of the system such that no messages are in transit           self. Furthermore, a process can continue its exe-
during the checkpointing process by flushing all re-         cution during logging [11]. This method has less
maining messages. The advantage of coordinated               overhead compared to others but can only toler-
checkpointing is that each checkpoint captures a             ate a single failure at a time and if a node itself
consistent global state which simplifies the recovery        fails the whole application has to be restarted. Pes-
of the whole application. Furthermore, coordinated           simistic protocols store all checkpoints on reliable
protocols have a comparatively low overhead since            media and processes have to halt execution until
they do not store messages between processes [9].            messages are logged. This enables them to tolerate
However, if one process fails all other processes need       multiple faults and complete node failures at the
to be restarted as well [11]. Also, with a growing           cost of a higher overhead using reliable media and
number of components and parallel tasks the cost             halting execution. Causal protocols also do not use
of storing checkpoints in a coordinated way steadily         reliable media but log the messages in neighboring
increases.                                                   nodes with additional causality information. This
Message-logging protocols store messages between             omits the need for reliable media and an applica-
nodes with a timestamp and the associated content,           tion is still able to recover if one node experiences
thus capturing the state of the network [9]. They            a complete failure.

                                                         3
Checkpointing methods can operate on different              rithms that can accept faults but ultimately con-
levels of the software stack which has an important         verge to a correct state, sometimes at the cost of
impact on transparency and efficiency. Checkpoint-          additional computations. Forward recovery algo-
ing can be implemented by the operating system,             rithms are non-masking since they can display er-
a library, or the application itself [14]. Operating        ratic behavior due to transient faults but finally dis-
system extensions can do global checkpoints that            play legitimate behavior again after a stabilization
provide transparent checkpointing of the whole ap-          period [18]. One class of algorithms having this
plication state and do not require modifications of         property are self-stabilizing algorithms.
the application code. There are also libraries that         Forward recovery can also be manually imple-
act as a layer on top of MPI and enable global and          mented by application developers. Resilient MPI
transparent checkpointing. However, global check-           implementations provide an interface which can be
pointing has a large overhead since the storage costs       used by application developers to specify behav-
are proportional to the memory footprint of the sys-        ior in case a fault occurs. Using this, develop-
tem [15] which often includes transient data or tem-        ers can implement custom forward recovery rou-
porary variables which are not critical for recovery.       tines that recover a consistent, fault-free applica-
Approaches such as compiler-based checkpointing             tion state. The advantage of forward recovery
intend to solve this issue by automatically detecting       methods is that they often allow faster recovery
transient variables and excluding them from check-          compared to rollback-recovery. However, they are
points which reduces the total size of checkpoints.         application specific and not applicable to all algo-
All of the previously mentioned approaches auto-            rithms.
matically store checkpoints which does not allow
any control of the procedure. User-level fault tol-         3.3    Silent Error Detection
erance libraries assume that developers know best
when to checkpoint and what data to store which             The objective of silent error detection methods is to
can significantly decrease the checkpoint size. How-        detect latent errors such as multi-bit faults. Silent
ever, this also increases application complexity [15]       error detection is a relatively novel research area of
and an application developer can be mistaken and            increasing importance since smaller components are
forget about critical data.                                 more vulnerable to cosmic radiation [1]. It is appli-
Researchers introduced different variations and ad-         cation specific and uses historic application data to
ditions to the general checkpointing techniques             detect anomalous results during the computations
to reduce the number of needed resources to                 of an application which indicates a silent data cor-
reach feasible performance for exascale HPC sys-            ruption (SDC).
tems. One of these approaches is diskless check-            Several methods can be used to detect anomalies in
pointing [16] which stores checkpoints in memory.           the application data. Vishnu et al. [19] use different
Other approaches include multi-level checkpointing          machine learning methods able to correctly detect
[17] which hierarchically combines multiple stor-           multi-bit errors in memory with a 97% chance. Af-
age technologies with different reliability and speed       ter a silent error is detected it has to be corrected.
allowing a significantly higher checkpointing fre-          Gomez et al. [20] just replace corrupted values with
quency since checkpointing on the first layer is gen-       their predictions. Since it is not always desirable
erally fast.                                                to use approximate predictions, rollback-recovery
                                                            can be used, to revert to a consistent application
                                                            state instead. Application developers can also im-
3.2    Forward recovery
                                                            plement application-specific routines to recover a
Forward recovery avoids restarting the application          normal state.
from a stored checkpoint and lets the application           One drawback of silent error detection is that it in-
recover by itself. It is only applicable for algo-          troduces computational overhead. Another limita-

                                                        4
tion is the false positive rate which has a significant       computations to verify the original results which
impact on the overall performance of an applica-              reduces the overhead compared to other replication
tion. Furthermore, it takes significant effort to de-         methods.
velop silent error detection methods since they are           Replication generally has a high overhead because
application specific. Nevertheless, silent error de-          multiples of all resources are needed depending on
tection is one of the few methods able to deal with           the replication factor. Furthermore, it is costly
silent errors and has less resource overhead com-             to check the consistency of all replications and it
pared to replication [21].                                    is difficult to manage non-deterministic computa-
                                                              tions [25]. Also, an overhead of messages is needed
                                                              to communicate with all replicas. Nevertheless,
4     Fault Containment                                       Ferreira et al. [25] showed that replication meth-
                                                              ods outperform traditional checkpointing methods
Fault containment methods let an application con-             if the MTBF is low.
tinue to execute until it terminates even after mul-
tiple faults occur. Compensation mechanisms are
in place to contain the fault and avoid an appli-             4.2    ABFT
cation failure. They also ensure that the applica-            ABFT was first proposed by Huang and Abraham
tion returns correct results after termination. The           in 1984 [26] and includes algorithms able to auto-
most commonly used fault containment methods                  matically detect and correct errors. ABFT is not
are replication and ABFT. Replication systems run             applicable to all algorithms but researchers are con-
copies of a task on different nodes in parallel and if        tinuously adapting new algorithms to be fault toler-
one fails others can take over providing the correct          ant because of the generally low overhead compared
result. ABFT includes all algorithms which can                to other methods. The core principle of ABFT is
detect and correct faults, that occur during compu-           to introduce redundant information which can later
tations.                                                      be used to correct computations.
                                                              ABFT algorithms can be distinguished into online
4.1    Replication                                            and offline methods [11]. Offline methods correct
                                                              errors after the computation is finished whereas on-
Replication methods replicate all computations on             line methods repair errors already during the com-
multiple nodes. If one replica is hit by a fault the          putation. The most prominent examples for offline
others can continue computation and the applica-              ABFT are matrix operations [27] where checksum
tion terminates as expected.                                  columns are calculated which are later used to ver-
Replication methods can be distinguished into                 ify and correct the result. Online ABFT can also be
methods that either deal with fail-stop or silent er-         applied to matrix operations [28] whereas they are
rors. MR-MPI [22] deals with fail-stop errors and             often more efficient than their offline counterparts
uses replicas for each node. As soon as one node              since they can correct faults in a timely manner,
fails the replica is taking over and continues execu-         when their impact is still small, compared to of-
tion. Only if both nodes fail the application needs           fline ABFT methods that correct faults only after
to be restarted. In contrast, RedMPI [23] is not ad-          termination.
dressing complete node failures but silent errors. It         The main drawback of ABFT is that it requires
compares the output of all replicas to correct SDCs.          modifications of the application code. However,
A novel research direction is to use approximate              since most applications rely on low-level algebra li-
replication to detect silent errors. A complemen-             braries the methods can be directly implemented
tary node is running an approximate computation               in those libraries hidden from an application pro-
which is then compared to the actual result to de-            grammer. Another disadvantage of ABFT is that
tect SDCs. Benson et al. [24] use cheap numerical             it cannot compensate for all kinds of faults, e.g.

                                                          5
complete node failures. Furthermore, if the num-           such as rollback-recovery. On the other hand, if
ber of faults during the computation exceeds the           the predictor detects too many false positives too
capabilities of the algorithm it cannot correct them       much time is spent on premature preventive mea-
anymore. ABFT also introduces overhead in the              sures. Models used for fault prediction mainly in-
computations which is however comparatively small          clude Machine Learning models [31] and statistical
to other fault tolerance methods.                          models [32].

                                                           5.2    Migration
5     Fault Avoidance
                                                           Once a fault is predicted measures have to be taken
In contrast to fault recovery and fault containment,       to prevent it from happening. The most common
fault avoidance or proactive fault tolerance methods       approaches for prevention are virtualization-based
aim at preventing faults before they occur. Gener-         migration and process migration which both move
ally, fault avoidance methods can be split in two          tasks from failing nodes to healthy ones. Both ap-
phases. First, predicting the time, location and           proaches can be divided into stop&copy and live
kind of upcoming faults. Second, taking preven-            migration [33]. Stop&copy methods halt the exe-
tive measures to avoid the fault and continue exe-         cution as long as it takes to copy the data to a tar-
cution. Proactive fault tolerance is a comparatively       get node which depends on the memory footprint,
new but promising field of research [1]. The pre-          whereas live migration methods continue execution
diction of fault occurrences mainly relies on tech-        during the copy process reducing the overall down-
niques from Data Mining, Statistics and Machine            time.
Learning which use RAS log files and sensor data.          Both virtualization and process-based methods
The most common preventive measure is to migrate           need available nodes for migration. They can ei-
tasks running on components which are about to             ther be designated spare nodes, which are expected
fail to healthy components.                                to become a commodity in future HCP systems, un-
                                                           used nodes, or the node with the lowest load [33].
5.1    Fault Prediction                                    Two tasks sharing the same node can however re-
                                                           sult in imbalance and an overall worse performance.
It is important to consider which data and model to        Most approaches generally use free available nodes
use to accurately predict faults. Engelmann et al.         first, then spare nodes, and as a last option the least
[29] derived four different types of health monitor-       busy node [33].
ing data that can be used to predict faults whereas        Nagarajan et al. [34] use Xen-virtualization-based
most prediction methods rely on Type 4 which uti-          migration while there approach is generally trans-
lizes a historical database that is used as training       ferrable to other virtualization techniques as well.
data for machine learning methods. It mostly in-           Each node runs a host virtual machine (VM) which
cludes sensor data such as fan speed, CPU tem-             is responsible for resource management and con-
perature and SMART metrics but also RAS logfiles           tains one or more VMs running the actual appli-
containing failure histories.                              cation. Xen supports live migration of VMs [35]
The success of fault avoidance methods mainly de-          which allows MPI applications to continue to exe-
pends on the accuracy of the predictions. It is            cute during migration.
important that predictors can predict both loca-           Process-based migration [36] has been widely stud-
tion and time of a fault in order to successfully          ied on several operating systems. It generally fol-
migrate a task [30]. Furthermore, the accuracy of          lows the same steps as virtualization-based migra-
the prediction is of high importance. If too many          tion but instead of copying the state of a VM it
faults are missed the advantage of a prediction-           copies the process memory [33]. Process-based mi-
based model is small compared to other approaches          gration also supports live migration.

                                                       6
Applications do not need to be modified for both            by defining an interface which supports multiple
process-based and VM-based migration. Virtual-              communication backends and allowing application
ization offers several advantages such as increased         developers to incrementally port their application
security, customized operating system, live mi-             to LAIK. Figure 3 shows an overview of how LAIK
gration, and isolation. Furthermore, VMs can                is integrated in HPC systems.
be more easily started and stopped compared
to real machines [37]. Despite of these advan-
tages virtualization-based techniques have not been
widely adopted in the HPC community because of
their larger overhead.
Fault avoidance cannot fully replace other ap-
proaches such as rollback-recovery since not all
faults are predictable. Nevertheless, fault avoid-
ance can be combined with other approaches such
as ABFT, replication and rollback-recovery. Since           Figure 3: Integration of LAIK in HPC systems [39]
it can significantly increase the MTBF it allows an
application to take less checkpoints which is shown         The capabilities of LAIK can be utilized for
by Aupy et al. [38] who study the impact of fault           application-integrated fault tolerance. It is possible
prediction and migration on checkpointing.                  to implement fault recovery, fault containment and
                                                            fault avoidance methods as layers on top of LAIK
                                                            using its index space management and partitioning
6    LAIK                                                   features.
                                                            LAIK can be used for local checkpointing by main-
LAIK (Leichtgewichtige Anwendungs-Integrierte               taining copies of relevant data structures on neigh-
Datenhaltungs Komponente) [39] is a library cur-            boring nodes [39]. Once a node failure is detected,
rently under development at Technical University            the application can revert to the checkpointed state
Munich which provides lightweight index space               of these copies and resume execution from there on
management and load balancing for HPC applica-              a spare node using LAIKs partitioning feature.
tions supporting application-integrated fault toler-        Furthermore, it is possible to implement replica-
ance. Its main objectives are to hide the complex-          tion in a similar way as checkpointing using LAIK.
ity of index space management and load balancing            Again multiple redundant copies of the data con-
from application developers, increasing the reliabil-       tainers are maintained on different nodes whereas
ity of HPC applications.                                    one node acts as the master node. Instead of only
LAIK decouples data decomposition from the ap-              using the redundant data after a failure as in check-
plication code so that applications can adapt to a          pointing, tasks can be run in parallel on each repli-
changing number of nodes, computational resources           cated data container. Additional synchronization
and computational requirements. The developer               mechanisms have to be utilized to keep the repli-
needs to assign application data to data containers         cas in a consistent state. In case the master node
using index spaces. Whenever required, LAIK can             experiences failure one of the replicas becomes the
trigger a partitioning algorithm which distributes          new master node and execution continues without
the data across nodes where the respective data             interruption.
container is needed. This allows higher flexibility         LAIK also allows proactive fault tolerance. Once a
in regard to fault tolerance and scheduling mecha-          prediction algorithm predicts a failure the nodes are
nisms. It also avoids application specific code which       synchronized, execution is stopped, and the failing
can become increasingly complex and hard to main-           node is excluded. Then, LAIK triggers the reparti-
tain. Furthermore, LAIK is designed to be portable          tioning algorithm and execution can be continued

                                                        7
on a different node preventing a complete applica-                silience: 2014 update,” Supercomputing fron-
tion failure.                                                     tiers and innovations, vol. 1, no. 1, pp. 5–28,
                                                                  2014.
                                                               [2] J. Shalf, S. Dosanjh, and J. Morrison, “Exas-
7     Conclusion                                                   cale computing technology challenges,” in In-
Fault tolerance methods aim at preventing failures                 ternational Conference on High Performance
by enabling applications to successfully terminate                 Computing for Computational Science, pp. 1–
in the presence of faults. Several distinct fault                  25, Springer, 2010.
tolerance methods were developed to solve the re-              [3] E. Strohmaier, “Top500 supercomputer,” in
silience challenge. Since the most common method                   Proceedings of the 2006 ACM/IEEE Confer-
global rollback-recovery is not feasible for future                ence on Supercomputing, SC ’06, (New York,
systems new methods have to be considered. How-                    NY, USA), ACM, 2006.
ever, currently no other method is able to fully
replace rollback-recovery. Nevertheless, methods               [4] K. Bergman, S. Borkar, D. Campbell, W. Carl-
such as user-level fault tolerance, multi-level check-             son, W. Dally, M. Denneau, P. Franzon,
pointing and diskless checkpointing aim at increas-                W. Harrod, K. Hill, J. Hiller, et al., “Exas-
ing the efficiency of checkpointing. Also, additional              cale computing study: Technology challenges
methods such as replication, failure prediction, and               in achieving exascale systems,” 2008.
process migration can be used complementary to
                                                               [5] A. Avizienis, J.-C. Laprie, B. Randell, and
rollback-recovery to decrease the MTBF and ap-
                                                                   C. Landwehr, “Basic concepts and taxonomy
plication specific methods such as ABFT, forward
                                                                   of dependable and secure computing,” IEEE
recovery and silent error detection can further im-
                                                                   transactions on dependable and secure comput-
prove the fault tolerance.
                                                                   ing, vol. 1, no. 1, pp. 11–33, 2004.
It is challenging to use the various existing fault tol-
erance libraries complementary since they rely on              [6] M. Snir, R. W. Wisniewski, J. A. Abraham,
different technologies. In order to reach exascale                 S. V. Adve, S. Bagchi, P. Balaji, J. Be-
resilience an integrated approach, which is able to                lak, P. Bose, F. Cappello, B. Carlson, et al.,
combine fault tolerance methods, is of high impor-                 “Addressing failures in exascale computing,”
tance. LAIK has the capabilities to act as a ba-                   The International Journal of High Perfor-
sis for such an approach. Researchers could build                  mance Computing Applications, vol. 28, no. 2,
layers on top of LAIK implementing new meth-                       pp. 129–173, 2014.
ods in a standardized and compatible way utilizing
LAIKs features. Furthermore, additional partition-             [7] B. Schroeder and G. A. Gibson, “Understand-
ing algorithms, data structures and communication                  ing failures in petascale computers,” in Jour-
backends could be implemented because of LAIKs                     nal of Physics: Conference Series, vol. 78,
modular architecture. This would enable a comple-                  p. 012022, IOP Publishing, 2007.
mentary use of multiple fault tolerance methods to             [8] A. Oliner and J. Stearley, “What supercom-
significantly increase the overall fault tolerance of              puters say: A study of five system logs,”
HPC systems.                                                       in Dependable Systems and Networks, 2007.
                                                                   DSN’07. 37th Annual IEEE/IFIP Interna-
                                                                   tional Conference on, pp. 575–584, IEEE,
References                                                         2007.
 [1] F. Cappello, A. Geist, W. Gropp, S. Kale,                 [9] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and
     B. Kramer, and M. Snir, “Toward exascale re-                  D. B. Johnson, “A survey of rollback-recovery

                                                           8
protocols in message-passing systems,” ACM              [18] S. Tixeuil, “Self-stabilizing algorithms,” in Al-
     Computing Surveys (CSUR), vol. 34, no. 3,                    gorithms and theory of computation handbook,
     pp. 375–408, 2002.                                           pp. 26–26, Chapman & Hall/CRC, 2010.

[10] G. Bosilca, A. Bouteiller, E. Brunet, F. Cap-           [19] A. Vishnu, H. van Dam, N. R. Tallent, D. J.
     pello, J. Dongarra, A. Guermouche, T. Her-                   Kerbyson, and A. Hoisie, “Fault modeling
     ault, Y. Robert, F. Vivien, and D. Zaidouni,                 of extreme scale applications using machine
     “Unified model for assessing checkpoint-                     learning,” in Parallel and Distributed Pro-
     ing protocols at extreme-scale,” Concurrency                 cessing Symposium, 2016 IEEE International,
     and Computation: Practice and Experience,                    pp. 222–231, IEEE, 2016.
     vol. 26, no. 17, pp. 2772–2791, 2014.
                                                             [20] L. A. B. Gomez and F. Cappello, “Detecting
[11] F. Cappello, “Fault tolerance in petas-                      and correcting data corruption in stencil appli-
     cale/exascale systems: Current knowledge,                    cations through multivariate interpolation,” in
     challenges and research opportunities,” The                  Cluster Computing (CLUSTER), 2015 IEEE
     International Journal of High Performance                    International Conference on, pp. 595–602,
     Computing Applications, vol. 23, no. 3,                      IEEE, 2015.
     pp. 212–226, 2009.
                                                             [21] E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan,
[12] S. Rao, L. Alvisi, and H. M. Vin, “The cost of               and F. Cappello, “Lightweight silent data cor-
     recovery in message logging protocols,” IEEE                 ruption detection based on runtime data anal-
     Transactions on Knowledge and Data Engi-                     ysis for hpc applications,” in Proceedings of
     neering, vol. 12, no. 2, pp. 160–173, 2000.                  the 24th International Symposium on High-
                                                                  Performance Parallel and Distributed Comput-
[13] L. Alvisi and K. Marzullo, “Message logging:
                                                                  ing, pp. 275–278, ACM, 2015.
     Pessimistic, optimistic, causal, and optimal,”
     IEEE Transactions on Software Engineering,              [22] C. Engelmann and S. Böhm, “Redundant ex-
     vol. 24, no. 2, pp. 149–159, 1998.                           ecution of hpc applications with mr-mpi,”
[14] E. Roman, “A survey of checkpoint/restart im-                in Proceedings of the 10th IASTED Interna-
     plementations,” in Lawrence Berkeley National                tional Conference on Parallel and Distributed
     Laboratory, Tech, Citeseer, 2002.                            Computing and Networks (PDCN), pp. 15–17,
                                                                  2011.
[15] J. Dongarra, T. Herault, and Y. Robert, “Fault
     tolerance techniques for high-performance               [23] D. Fiala, F. Mueller, C. Engelmann, R. Riesen,
     computing,” in Fault-Tolerance Techniques                    K. Ferreira, and R. Brightwell, “Detection
     for High-Performance Computing, pp. 3–85,                    and correction of silent data corruption for
     Springer, 2015.                                              large-scale high-performance computing,” in
                                                                  Proceedings of the International Conference
[16] J. S. Plank, K. Li, and M. A. Puening, “Disk-                on High Performance Computing, Networking,
     less checkpointing,” IEEE Transactions on                    Storage and Analysis, p. 78, IEEE Computer
     Parallel and Distributed Systems, vol. 9, no. 10,            Society Press, 2012.
     pp. 972–986, 1998.
                                                             [24] A. R. Benson, S. Schmit, and R. Schreiber,
[17] N. H. Vaidya, “A case for two-level dis-                     “Silent error detection in numerical time-
     tributed recovery schemes,” in ACM SIGMET-                   stepping schemes,” The International Journal
     RICS Performance Evaluation Review, vol. 23,                 of High Performance Computing Applications,
     pp. 64–73, ACM, 1995.                                        vol. 29, no. 4, pp. 403–421, 2015.

                                                         9
[25] K. Ferreira, J. Stearley, J. H. Laros, R. Old-          [32] L. Yu, Z. Zheng, Z. Lan, and S. Coghlan,
     field, K. Pedretti, R. Brightwell, R. Riesen,                “Practical online failure prediction for blue
     P. G. Bridges, and D. Arnold, “Evaluating                    gene/p: Period-based vs event-driven,” in De-
     the viability of process replication reliability             pendable Systems and Networks Workshops
     for exascale systems,” in High Performance                   (DSN-W), 2011 IEEE/IFIP 41st International
     Computing, Networking, Storage and Analy-                    Conference on, pp. 259–264, IEEE, 2011.
     sis (SC), 2011 International Conference for,
                                                             [33] C. Wang, F. Mueller, C. Engelmann, and
     pp. 1–12, IEEE, 2011.
                                                                  S. L. Scott, “Proactive process-level live migra-
[26] K.-H. Huang et al., “Algorithm-based fault tol-              tion and back migration in hpc environments,”
     erance for matrix operations,” IEEE transac-                 Journal of Parallel and Distributed Comput-
     tions on computers, vol. 100, no. 6, pp. 518–                ing, vol. 72, no. 2, pp. 254–267, 2012.
     528, 1984.                                              [34] A. B. Nagarajan, F. Mueller, C. Engelmann,
                                                                  and S. L. Scott, “Proactive fault tolerance for
[27] Z. Chen and J. Dongarra, “Algorithm-based                    hpc with xen virtualization,” in Proceedings
     checkpoint-free fault tolerance for parallel ma-             of the 21st annual international conference on
     trix computations on volatile resources,” in                 Supercomputing, pp. 23–32, ACM, 2007.
     Parallel and Distributed Processing Sympo-
     sium, 2006. IPDPS 2006. 20th International,             [35] C. Clark, K. Fraser, S. Hand, J. G.
     pp. 10–pp, IEEE, 2006.                                       Hansen, E. Jul, C. Limpach, I. Pratt, and
                                                                  A. Warfield, “Live migration of virtual ma-
[28] P. Wu, C. Ding, L. Chen, F. Gao, T. Davies,                  chines,” in Proceedings of the 2nd Confer-
     C. Karlsson, and Z. Chen, “Fault tolerant                    ence on Symposium on Networked Systems De-
     matrix-matrix multiplication: correcting soft                sign & Implementation-Volume 2, pp. 273–286,
     errors on-line,” in Proceedings of the second                USENIX Association, 2005.
     workshop on Scalable algorithms for large-scale
                                                             [36] D. S. Milojičić, F. Douglis, Y. Paindaveine,
     systems, pp. 25–28, ACM, 2011.
                                                                  R. Wheeler, and S. Zhou, “Process migration,”
[29] C. Engelmann, G. R. Vallee, T. Naughton, and                 ACM Computing Surveys (CSUR), vol. 32,
     S. L. Scott, “Proactive fault tolerance using                no. 3, pp. 241–299, 2000.
     preemptive migration,” in Parallel, Distributed         [37] W. Huang, J. Liu, B. Abali, and D. K. Panda,
     and Network-based Processing, 2009 17th Eu-                  “A case for high performance computing with
     romicro International Conference on, pp. 252–                virtual machines,” in Proceedings of the 20th
     257, IEEE, 2009.                                             annual international conference on Supercom-
                                                                  puting, pp. 125–134, ACM, 2006.
[30] A. Gainaru, F. Cappello, M. Snir, and
     W. Kramer, “Fault prediction under the mi-              [38] G. Aupy, Y. Robert, F. Vivien, and
     croscope: A closer look into hpc systems,”                   D. Zaidouni, “Checkpointing algorithms and
     in Proceedings of the International Conference               fault prediction,” Journal of Parallel and Dis-
     on High Performance Computing, Networking,                   tributed Computing, vol. 74, no. 2, pp. 2048–
     Storage and Analysis, p. 77, IEEE Computer                   2064, 2014.
     Society Press, 2012.                                    [39] J. Weidendorfer, D. Yang, and C. Trinitis,
                                                                  “Laik: A library for fault tolerant distribu-
[31] E. W. Fulp, G. A. Fink, and J. N. Haack, “Pre-
                                                                  tion of global data for parallel applications,” in
     dicting computer system failures using sup-
                                                                  Konferenzband des PARS’17 Workshops, p. 10,
     port vector machines.,” WASL, vol. 8, pp. 5–5,
                                                                  2017.
     2008.

                                                        10
You can also read