Toward Exascale Resilience: 2014 update

Franck Cappello; Al Geist; William Gropp; Sanjay Kale; Bill Kramer; Marc Snir

doi:10.14529/jsfi140101

Authors

Franck Cappello Argonne National Laboratory, Chicago
Al Geist Oak Ridge National Laboratory, Oak Ridg
William Gropp University of Illinois at Urbana Champaign, Champaign
Sanjay Kale University of Illinois at Urbana Champaign, Champaign
Bill Kramer University of Illinois at Urbana Champaign, Champaign
Marc Snir Argonne National Laboratory, Chicago

DOI:

https://doi.org/10.14529/jsfi140101

Abstract

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions.

The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that exascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

References

The blue waters super system for super science. Contemporary High Performance Computing From Petascale toward Exascale, Jerey S. Vetter editor, Chapman and Hall/CRC, pages 339-366, isbn: 978-1-4665-6834-1, 2013.

Guillaume Aupy, Anne Benoit, Thomas Herault, Yves Robert, and Jack Dongarra. Optimal checkpointing period: Time vs. energy. CoRR, abs/1310.8456, 2013.

A. Avizienis, J.C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11-33, 2004.

L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: high performance fault tolerance interface for hybrid systems. In Proc. 2011 Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC11), pages 32:1-32:32. ACM, 2011.

Austin R Benson, Sven Schmit, and Robert Schreiber. Silent error detection in numerical time-stepping schemes. International Journal of High Performance Computing Applications, April, 2014.

Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T. Chong. Exploiting data similarity to reduce memory footprints. In IPDPS, pages 152-163, 2011.

W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard mpi. July 2013.

Wesley Bland. User level failure mitigation in mpi. In Euro-Par 2012: Parallel Processing Workshops, pages 499-504. Springer, 2013.

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni. Unied model for assessing checkpointing protocols at extreme-scale. November 2013.

Mohamed-Slim Bouguerra, Ana Gainaru, Leonardo Arturo Bautista-Gomez, Franck Cappello, Satoshi Matsuoka, and Naoya Maruyama. Improving the computing eciency of hpc systems using a combination of proactive and preventive checkpointing. In IPDPS, pages 501-512, 2013.

A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Correlated set coordination in fault tolerant message logging protocols. Vol. 25, No. 4:pp. 572-585, 2013.

Aurelien Bouteiller, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Hrault, and Yves Robert. Multi-criteria checkpointing strategies: Response-time versus resource utilization. In Felix Wolf, Bernd Mohr, and Dieter Mey, editors, Euro-Par 2013 Parallel Processing, volume 8097 of Lecture Notes in Computer Science, pages 420-431. Springer Berlin Heidelberg, 2013.

P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012.

PatrickG. Bridges, Mark Hoemmen, KurtB. Ferreira, MichaelA. Heroux, Philip Soltero, and Ron Brightwell. Cooperative application/OS DRAM fault recovery. In Michael Alexander, Pasqua DAmbra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, StephenL. Scott, JesperLarsson Tra, Georoy Valle, and Josef Weidendorfer, editors, Euro-Par 2011: Parallel Processing Workshops, volume 7156 of Lecture Notes in Computer Science, pages 241-250. Springer Berlin Heidelberg, 2012.

Allan G Bromley. Charles babbage's analytical engine, 1838. Annals of the History of Computing, 4(3):196-217, 1982.

Greg Bronevetsky and Bronis de Supinski. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd annual international conference on Supercomputing, pages 155-164. ACM, 2008.

Greg Bronevetsky, Daniel J. Marques, Keshav K. Pingali, Radu Rugina, and Sally A. McKee. Compiler-enhanced incremental checkpointing for openmp applications. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08, pages 275-276, New York, NY, USA, 2008. ACM.

Franck Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212-226, 2009.

Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward exascale resilience. International Journal of High Performance Computing Applications, 23(4):374-388, 2009.

Franck Cappello, Amina Guermouche, and Marc Snir. On communication determinism in parallel hpc applications. In Computer Communications and Networks (ICCCN), 2010 Proceedings of 19th International Conference on, pages 1-8. IEEE, 2010.

Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 91-100, New York, NY, USA, 2012. ACM.

Sayantan Chakravorty and L. V. Kale. A fault tolerance protocol with fast fault recovery. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium. IEEE Press, 2007.

Zizhong Chen. Online-abft: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 167-176, New York, NY, USA, 2013. ACM.

Zizhong Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Parallel and Distributed Processing Symposium, International, page 76, Los Alamitos, CA, USA, 2006. IEEE Computer Society.

Zizhong Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. Parallel and Distributed Systems, IEEE Transactions on, 19(12):1628-1641, Dec 2008.

Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 213-223, New York, NY, USA, 2005. ACM.

Hyungmin Cho, Shahrzad Mirkhani, Chen-Yong Cher, Jacob A Abraham, and Subhasish Mitra. Quantitative evaluation of soft error injection techniques for robust system design. In Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pages 1-10. IEEE, 2013.

Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. Containment domains: A scalable, ecient, and exible resilience scheme for exascale systems. In the Proceedings of SC12, November 2012.

John T Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303-312, 2006.

Teresa Davies and Zizhong Chen. Correcting soft errors online in lu factorization. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, HPDC '13, pages 167-178, New York, NY, USA, 2013. ACM.

Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: A fault tolerant implementation without checkpointing. In Proceedings of the International Conference on Supercomputing, ICS '11, pages 162-171, New York, NY, USA, 2011. ACM.

N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and W. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, January 2010.

Nathan DeBardeleben, Sean Blanchard, Qiang Guan, Ziming Zhang, and Song Fu. Experimental framework for injecting logic errors in a virtual machine to prole applications for soft error resilience. In Proceedings of the 2011 International Conference on Parallel Processing - Volume 2, Euro-Par'11, pages 282-291, Berlin, Heidelberg, 2012. Springer-Verlag.

Catello Di Martino, F Baccanico, W Kramer, J Fullop, Z Kalbarczyk, and R Iyer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014), 2014.

R.W. Downing, J.S. Nowak, and L.S. Tuomenoksa. No. 1 ESS maintenance plan. Bell System Technical Journal, 43:5:1961-2019, 1964.

Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault, and Jack Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, pages 225-234, New York, NY, USA, 2012. ACM.

Peter D. Dben, Jaume Joven, Avinash Lingamneni, Hugh McNamara, Giovanni De Michel, Krishna V. Palem, and T. N. Palmer. Low-cost concurrent error detection for floating point unit (fpu) controllers. Philosophical Transactions of the Royal Society A, 20130276, 372(2018), 2014.

John Daly (editor), Bob Adolf, Shekhar Borkar, Nathan DeBardeleben, Mootaz Elnozahy, Mike Heroux, David Rogers, Rob Ross, Vivek Sarkar, Martin Schulz, Marc Snir, and Paul Woodward. Inter Agency Workshop on HPC Resilience at Extreme Scale. http://institute.lanl.gov/resilience/docs/Inter-AgencyResilienceReport.pdf, February 2012.

Mootaz Elnozahy (editor), Ricardo Bianchini, Tarek El-Ghazawi, Armando Fox, Forest Godfrey, Adolfy Hoisie, Kathryn McKinley, Rami Melhem, James Plank, Partha Ranganathan, and Josh Simons. System resilience at extreme scale. Technical report, Defense Advanced Research Project Agency (DARPA), 2009.

Patrick J Eibl, Andrew D Cook, and Daniel J Sorin. Reduced precision checking for a floating point adder. In Defect and Fault Tolerance in VLSI Systems, 2009. DFT'09. 24th IEEE International Symposium on, pages 145-152. IEEE, 2009.

Mohammed el Mehdi Diouri, Olivier Gluck, Laurent Lefevre, and Franck Cappello. Energy considerations in checkpointing and fault tolerance protocols. In DSN Workshops, pages 1-6, 2012.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Combining partial redundancy and checkpointing for hpc. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, pages 615-626, June 2012.

James Elliott, Mark Hoemme, and Frank Mueller. Evaluating the impact of sdc on the gmres iterative solver. In Proceedings of International Parallel and Distributed Processing Symposium, IPDPS'14, 2014.

Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375-408, 2002.

Christian Engelmann. Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale. Future Generation Computer Systems (FGCS), 30(0):59-65, January 2014.

Christian Engelmann and Swen Bohm. Redundant execution of HPC applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, 2011.

Christian Engelmann, Georoy R. Vallee, Thomas Naughton, and Stephen L. Scott. Proactive fault tolerance using preemptive migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, February 18-20, 2009.

Pete Beckman et al. Argo: An exascale operating system. In http://www.mcs.anl.gov/project/argo-exascale-operating-system.

Ron Brightwell et al. Hobbes - an operating system for extreme-scale systems. In http://xstack.sandia.gov/hobbes/.

Graham E. Fagg and Jack Dongarra. Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 346-353, London, UK, UK, 2000. Springer-Verlag.

Kurt Ferreira, Rolf Riesen, Ron Oldeld, Jon Stearley, James Laros, Kevin Pedretti, and Ron Brightwell. rmpi: increasing fault resiliency in a message-passing environment. Technical Report SAND2011-2488, Sandia National Laboratories, Albuquerque, New Mexico, 2011.

Kurt B Ferreira, Rolf Riesen, Patrick Bridges, Dorian Arnold, Jon Stearley, H Laros III James, Ron Oldeld, Kevin Pedretti, and Ron Brightwell. Evaluating the viability of process replication reliability for exascale systems. In ACM/IEEE Conference on Supercomputing (SC11), Nov 2011.

David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. Detection and correction of silent data corruption for large-scale high performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 78:1-78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Ana Gainaru, Franck Cappello, and William Kramer. Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 1168-1179. IEEE, 2012.

Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. Fault prediction under the microscope: A closer look into hpc systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 77. IEEE Computer Society Press, 2012.

Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. Failure prediction for hpc systems and applications current situation and open issues. International Journal of High Performance Computing Applications, 27(3):273-282, 2013.

Al Geist. Private communication, 2012.

Al Geist, Bob Lucas, Marc Snir, Shekhar Borkar, Eric Roman, Mootaz Elnozahy, Bert Still, Andrew Chien, Robert Clay, John Wu, Christian Engelmann, Nathan DeBardeleben, Rob Ross Larry Kaplan Martin Schulz, Mike Heroux, Sriram Krishnamoorthy, Lucy Nowell, Abhinav Vishnu, and Lee-Ann Talley. U.S. Department of Energy fault management workshop. Technical report, DOE, 2012.

William Gropp, Robert Ross, and Neill Miller. Providing ecient I/O redundancy in MPI environments. In Dieter Kranzlmuller, Peter Kacsuk, and Jack Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number LNCS3241 in Lecture Notes in Computer Science, pages 77-86. Springer Verlag, 2004. 11th European PVM/MPI User's Group Meeting, Budapest, Hungary.

William D. Gropp and Ewing Lusk. Fault tolerance in MPI programs. International Journal of High Performance Computer Applications, 18(3):363-372, 2004.

Amina Guermouche, Thomas Ropars, Marc Snir, and Franck Cappello. Hydee: Failure containment without event logging for large scale send-deterministic mpi applications. In IPDPS, pages 1216-1227, 2012.

Rinku Gupta, Pete Beckman, B-H Park, Ewing Lusk, Paul Hargrove, Al Geist, Dhabaleswar K Panda, Andrew Lumsdaine, and Jack Dongarra. Cifts: A coordinated infrastructure for fault-tolerant systems. In Parallel Processing, 2009. ICPP'09. International Conference on, pages 237-245. IEEE, 2009.

Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.

Chen-Han Ho, Marc de Kruijf, Karthikeyan Sankaralingam, Barry Rountree, Martin Schulz, and Bronis R. de Supinski. Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing. In ICPP, pages 510-519, 2012.

Kuang-Hua Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518-528, June 1984.

Joshua Hursey, Richard L. Graham, Greg Bronevetsky, Darius Buntinas, Howard Pritchard, and David G. Solt. Run-through stabilization: An mpi proposal for process fault tolerance. In EuroMPI, pages 329-332, 2011.

Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. Scientic Programming, 21(3-4):149-163, 2013.

William M. Jones, John T. Daly, and Nathan DeBardeleben. Impact of sub-optimal checkpoint intervals on application eciency in computational clusters. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC'10, pages 276-279, New York, NY, USA, 2010. ACM.

D. S. Katz, J. Daly, N. DeBardeleben, M. Elnozahy, B. Kramer, L. Lathrop, N. Nystrom, K. Milfeld, S. Sanielevici, S. Cott, , and L. Votta. 2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009. Technical Report Technical Memorandum ANL/MCS-TM-312, MCS, ANL, December 2009.

D.S. Katz and R.R. Some. Nasa advances robotic space exploration. Computer, 36(1):52-61, 2003.

Ikhwan Lee, Michael Sullivan, Evgeni Krimer, DongWan Kim, Mehmet Basoglu, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. Survey of error and fault detection mechanisms v2. Technical Report TR-LPH-2012-001, LPH Group, Department of Electrical and Computer Engineering, The University of Texas at Austin, December 2012.

Dong Li, Zizhong Chen, Panruo Wu, and Jerey S Vetter. Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.

Dong Li, Jerey S Vetter, andWeikuan Yu. Classifying soft error vulnerabilities in extremescale scientic applications using a binary instrumentation tool. In SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Salt Lake City, 11/2012 2012.

Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A proactive fault tolerance framework for high-performance computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, 2010.

M. Maniatakos, P. Kudva, B.M. Fleischer, and Y. Makris. Low-cost concurrent error detection for floating-point unit (fpu) controllers. Computers, IEEE Transactions on, 62(7):1376-1388, July 2013.

E. Meneses, O. Sarood, and L.V. Kale. Energy prole of rollback-recovery strategies in high performance computing. Parallel Computing, 2014.

Esteban Meneses, Laxmikant V. Kale, and Greg Bronevetsky. Dynamic load balance for optimized message logging in fault tolerant hpc applications. In CLUSTER, pages 281-289, 2011.

A. Moody, G. Bronevetsky, K. Mohror, and B.R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. 2010 Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC10), pages 1-11, 2010.

Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kale. Acr: Automatic checkpoint/restart for soft and hard error protection. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13. IEEE Computer Society, November 2013.

Xiang Ni, Esteban Meneses, and Laxmikant V. Kale. Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In IEEE Cluster 12, Beijing, China, September 2012.

Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. Nvcr: A transparent checkpoint restart library for nvidia cuda. In IPDPS Workshops, pages 104-113, 2011.

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications. Optimization of multi-level checkpoint model for large scale hpc applications. In Proceedings of IEEE IPDPS 2014, 2014.

A. Rezaei, G. Coviello, CH. Li, S. Chakradhar, and .F Mueller. Snapify: Capturing snapshots of ooad applications on xeon phi manycore processors. In Proceedings of High-Performance Parallel and Distributed Computing, HPDC'14, 2014.

Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, Andre Schiper, and Franck Cappello. Spbc: leveraging the characteristics of mpi hpc applications for scalable checkpointing. In SC, page 8, 2013.

Takafumi Saito, Kento Sato, Hitoshi Sato, and Satoshi Matsuoka. Energy-aware i/o optimization for checkpoint and restart on a nand ash memory system. In FTXS, pages 41-48, 2013.

Osman Sarood, Esteban Meneses, and L. V. Kale. A Cool Way of Improving the Reliability of HPC Machines. In Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, November 2013.

Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R. de Supinski, and Satoshi Matsuoka. Design and modeling of a non-blocking checkpointing system. In SC, page 19, 2012.

Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. Algorithmic approaches to low overhead fault detection for sparse linear algebra. 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 0:1-12, 2012.

Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 0:1-12, 2013.

Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus, Nathan A DeBardeleben, Pedro C Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing failures in exascale computing. International Journal of High Performance Computing Applications, 28(2):129-173, May 2014.

L. Spainhower and T.A. Gregg. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development, 43(5.6):863-873, 1999.

Omer Subasi, Javier Arias, Jesus Labarta, Osman Unsal, Adrian Cristal, and Barcelona Supercomputing Center. Leveraging a task-based asynchronous data ow substrate for efficient and scalable resiliency. 2014.

Alexander Randall V. The Eckert tapes: Computer pioneer says ENIAC team couldn't aord to fail - and didn't. Computerworld, 40(8), February 2006.

Chao Wang, F. Mueller, C. Engelmann, and S.L. Scott. Hybrid checkpointing for mpi jobs in hpc environments. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on, pages 524-533, Dec 2010.

ChaoWang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive process level live migration and back migration in hpc environments. J. Parallel Distrib. Comput., 72(2):254-267, February 2012.

Gulay Yalcin, Osman Sabri Unsal, and Adrian Cristal. Fault tolerance for multi-threaded applications by leveraging hardware transactional memory. In Proceedings of the ACM International Conference on Computing Frontiers, page 4. ACM, 2013.

John W Young. A rst order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530-531, 1974.

Gengbin Zheng, Chao Huang, and Laxmikant V. Kale. Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++. ACM SIGOPS Operating Systems Review: Operating and Runtime Systems for High-end Computing Systems, 40(2), April 2006.

Gengbin Zheng, Xiang Ni, and L. V. Kale. A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale. In Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, USA, June 2012.

Ziming Zheng, Andrew A. Chien, and Keita Teranishi. Fault tolerance in an inner-outer solver: a gvr-enabled case study. In in Proceedings of VECPAR 2014, Proceedings available from Springer-Verlag Lecture Notes in Computer Science., 2014.

Ziming Zheng, Zhiling Lan, Rinku Gupta, Susan Coghlan, and Peter Beckman. A practical failure prediction with location and lead time for blue gene/p. In Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on, pages 15-22. IEEE, 2010.

Ziming Zheng, Li Yu, Wei Tang, Zhiling Lan, Rinku Gupta, Narayan Desai, Susan Coghlan, and Daniel Buettner. Co-analysis of ras log and job log on blue gene/p. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 840-851. IEEE, 2011.