Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Authors

  • Saurabh Hukerikar Oak Ridge National Laboratory
  • Christian Engelmann Oak Ridge National Laboratory

DOI:

https://doi.org/10.14529/jsfi170301

Abstract

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore, the resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in HPC systems future systems are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. As a result, the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems.

In this paper, we develop a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems. The complete catalog of resilience design patterns provides designers with reusable design elements. We also define a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also supports optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types.

References

Nagios monitoring system (1999), https://www.nagios.org/, accessed (2017-08-15)

Lustre file system, high-performance storage architecture and scalable cluster file system,white paper. Tech. rep., Sun Microsystems, Inc. (December 2007)

Agelastos, A., Allan, B., Brandt, J., Cassella, P., Enos, J., Fullop, J., Gentile, A., Monk, S., Naksinehaboon, N., Ogden, J., Rajan, M., Showerman, M., Stevenson, J., Taerat, N., Tucker, T.: Lightweight distributed metric service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of IEEE/ACM International Conference for High Performance Storage, Networking, and Analysis (SC14). IEEE/ACM (2014), DOI: 10.1109/sc.2014.18

Alexander, C., Ishikawa, S., Silverstein, M.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press, New York (August 1977)

Avizienis, A.: Toward systematic design of fault-tolerant systems. Computer 30(4), 51–58 (April 1997), DOI: 10.1109/2.585154

Batchu, R., Dandass, Y.S., Skjellum, A., Beddhu, M.: Mpi/ft: A model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7(4), 303–315 (2004), DOI: 10.1023/b:clus.0000039491.64560.8a

Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of mpi communication capability: Design and rationale. International Journal of High Performance Computing Applications 27(3), 244–254 (2013), DOI: 10.1177/1094342013488238

Borchers, J.: A Pattern Approach to Interaction Design. John Wiley & Sons, Inc., New York, NY, USA (2001)

Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (November 2005), DOI: 10.1109/mm.2005.110

Bouteiller, A., Bosilca, G., Dongarra, J.: Redesigning the message logging model for high performance. Concurrency and Computation: Practice and Experience 22(16), 2196–2211 (2010), DOI: 10.1002/cpe.1589

Buschmann, F., Henney, K., Schmidt, D.C.: Pattern-Oriented Software Architecture - Volume 4: A Pattern Language for Distributed Computing. Wiley Publishing (2007)

Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., Stal, M.: Pattern-Oriented Software Architecture - Volume 1: A System of Patterns. Wiley Publishing (1996)

Chien, A., Balaji, P., Dun, N., Fang, A., Fujita, H., Iskra, K., Rubenstein, Z., Zheng, Z., Hammond, J., Laguna, I., Richards, D., Dubey, A., van Straalen, B., Hoemmen, M., Heroux, M., Teranishi, K., Siegel, A.: Exploring versioned distributed arrays for resilience in scientific applications: global view resilience. The International Journal of High Performance Computing Applications (2016), DOI: 10.1177/1094342016664796

Chung, J., Lee, I., Sullivan, M., Ryoo, J.H., Kim, D.W., Yoon, D.H., Kaplan, L., Erez, M.: Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. pp. 58:1–58:11 (2012)15. Clustering, P.H.A.: Pvfs2 development team (June 2004), https://goo.gl/VRAahX, accessed (2017-08-15)

Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, MA, USA (1991)

Cole, M.: Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallel programming. Parallel Computing 30(3), 389–406 (Mar 2004), DOI: 10.1016/j.parco.2003.12.002

Cray Inc.: Cray xe6 computing platform (2010), http://www.cray.com/sites/default/files/resources/CrayXE6Brochure.pdf

Cray Inc.: Cray xc40 computing platform (2014), http://www.cray.com/Assets/PDF/products/xc/CrayXC40Brochure.pdf

van Dam, Hubertus, J.J., Vishnu, A., De Jong, W.A.: A case for soft error detection and correction in computational chemistry. Journal of Chemical Theory and Computation 9(9), 3995–4005 (2013), DOI: 10.1021/ct400489c

Dell, I.H.P.: Intelligent platform management interface (ipmi), v2.0 specification (2015), DOI: 10.2172/1104721

Dongarra, J., Beckman, P., Moore, T., et al.: The International Exascale Software Project Roadmap. International Journal on High Performance Computing Applications pp. 3–60 (February 2011)

Dougherty, C., Sayre, K., Seacord, R., Svoboda, D., Togashi, K.: Secure design patterns. Tech. Rep. CMU/SEI-2009-TR-010, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA (2009), DOI: 10.21236/ada636498

Dreslinski, R.G., Wieckowski, M., Blaauw, D., Sylvester, D., Mudge, T.: Near-threshold computing: Reclaiming moore’s law through energy efficient integrated circuits. Proceedings of the IEEE 98(2), 253–266 (February 2010), DOI: 10.1109/jproc.2009.2034764

Duell, J., Hargrove, P., Roman, E.: The design and implementation of berkeley lab’s linux checkpoint/restart. Tech. rep., Lawrence Berkeley National Lab (LBNL) (December 2002), DOI: 10.2172/891617

Duyne, D.K.V., Landay, J., Hong, J.I.: The Design of Sites: Patterns, Principles, and Processes for Crafting a Customer-Centered Web Experience. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2002)

Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (Sep 2002)

Engelmann, C., Bohm, S.: Redundant execution of HPC applications with MR-MPI. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN). pp. 31–38 (February 2011), DOI: 10.2316/p.2011.719-03129.

Ferreira, K., Riesen, R., Oldfield, R., Stearley, J., Laros, J., Pedretti, K., Brightwell, R.: Rmpi: increasing fault resiliency in a message-passing environment. Tech. rep., Sandia National Laboratories, Technical Report SAND2011-2488 (2011), DOI: 10.2172/1012733

Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. pp. 78:1–78:12. SC ’12 (2012), DOI: 10.1109/sc.2012.49

Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Mini-ckpts: Surviving os failures in persistent memory. In: Proceedings of the 2016 International Conference on Supercomputing. pp. 7:1–7:14. ICS ’16 (2016), DOI: 10.1145/2925426.2926295

Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2002)

Friedrichsen, U.: No crash allowed - patterns for fault tolerance. In: The Conference for Java and Software Innovation (October 2012)

Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1995), DOI: 10.1007/978-3-642-59412-0_40

Hanmer, R.: Patterns for Fault Tolerant Software. Wiley Publishing (2007)

Heer, J., Agrawala, M.: Software design patterns for information visualization. IEEE Transactions on Visualization and Computer Graphics 12(5), 853–860 (Sep 2006), DOI: 10.1109/tvcg.2006.178

Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 518–528 (June 1984), DOI: 10.1109/tc.1984.1676475

Hukerikar, S., Engelmann, C.: Resilience design patterns: A structured approach to resilience at extreme scale (version 1.1). Tech. Rep. ORNL/TM-2016/767, Oak Ridge National Laboratory, Oak Ridge, TN, USA (December 2016), http://www.christian-engelmann.info/publications/hukerikar16rdp-11.pdf, DOI: 10.2172/1345793

Hursey, J., Mattox, T.I., Lumsdaine, A.: Interconnect agnostic checkpoint/restart in open mpi. In: HPDC ’09: Proceedings of the 18th ACM international symposium on High Performance Distributed Computing. pp. 49–58. ACM, New York, NY, USA (2009), DOI: 10.1145/1551609.1551619

Keutzer, K., Mattson, T.: Our pattern language (opl): A design pattern language for engineering (parallel) software. In: ParaPLoP Workshop on Parallel Programming Patterns (2009), DOI: 10.1109/wicsa.2007.32

Kircher, M., Jain, P.: Pattern-Oriented Software Architecture, Volume 3: Patterns for Resource Management. John Wiley & Sons, Inc., New York, NY, USA (2004)

Koren, I., Su, S.Y.H.: Reliability analysis of n-modular redundancy systems with intermittent and permanent faults. IEEE Transactions on Computers 28(7), 514–520 (July 1979), DOI: 10.1109/tc.1979.167539743.

de Kruijf, M., Nomura, S., Sankaralingam, K.: Relax: an architectural framework for software recovery of hardware faults. In: Proceedings of the 37th annual

international symposium on Computer architecture. pp. 497–508. ISCA ’10 (2010), DOI: 10.1145/1815961.1816026

Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30(7), 817 – 840 (2004), DOI: 10.1016/j.parco.2004.04.001

Mattson, T., Sanders, B., Massingill, B.: Patterns for Parallel Programming. AddisonWesley Professional, first edn. (2004)

McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edn. (2012), DOI: 10.1016/b978-0-12-415993-8.00003-7

McCool, M.D.: Structured parallel programming with deterministic patterns. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Parallelism. pp. 5–5. HotPar’10, USENIX Association, Berkeley, CA, USA (2010)

Mohror, K., Moody, A., Bronevetsky, G., de Supinski, B.R.: Detailed modeling and evaluation of a scalable multilevel checkpointing system. IEEE Transactions on Parallel and Distributed Systems 99, 1 (2013), DOI: 10.1109/tpds.2013.100

Moon, T.K.: Error correction coding: Mathematical methods and algorithms (2005) 50. Natarajan, B., Gokhale, A., Yajnik, S., Schmidt, D.C.: Doors: towards high-performance fault tolerant corba. In: Proceedings of the International Symposium on Distributed Objects and Applications. pp. 39–48 (2000), DOI: 10.1109/doa.2000.874174

Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 426–435. KDD ’03, ACM, New York, NY, USA (2003), DOI: 10.1145/956790.956799

Saridakis, T.: A system of patterns for fault tolerance. In: Proceedings of 2002 European Conference on Pattern Languages of Programs (EuroPLoP) (2002)

SchedMD: Slurm workload manager (2003), https://slurm.schedmd.com/

Schmidt, D.C., Stal, M., Rohnert, H., Buschmann, F.: Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects. John Wiley & Sons, Inc., New York, NY, USA, 2nd edn. (2000)

Shalf, J., Quinlan, D., Janssen, C.: Rethinking hardware-software codesign for exascale systems. Computer 44(11), 22–30 (November 2011), DOI: 10.1109/mc.2011.300

Stellner, G.: Cocheck: checkpointing and process migration for mpi. In: Proceedings of International Conference on Parallel Processing. pp. 526–531 (Apr 1996),

DOI: 10.1109/ipps.1996.50810657. Talton, J., Yang, L., Kumar, R., Lim, M., Goodman, N., Mˇech, R.: Learning design patterns with bayesian grammar induction. In: Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology. pp. 63–74. UIST ’12, ACM, New York, NY, USA (2012), DOI: 10.1145/2380116.2380127

Downloads

Published

2017-10-19

How to Cite

Hukerikar, S., & Engelmann, C. (2017). Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Supercomputing Frontiers and Innovations, 4(3), 4–42. https://doi.org/10.14529/jsfi170301