A Survey: Runtime Software Systems for High Performance Computing

Thomas Sterling; Matthew Anderson; Maciej Brodowicz

doi:10.14529/jsfi170103

Authors

Thomas Sterling Center for Research in Extreme Scale Technologies, Indiana University, Bloomington
Matthew Anderson Center for Research in Extreme Scale Technologies, Indiana University, Bloomington
Maciej Brodowicz Center for Research in Extreme Scale Technologies, Indiana University, Bloomington

DOI:

https://doi.org/10.14529/jsfi170103

Abstract

HPC system design and operation are challenged by the critical requirements for signicant advances in eciency, scalability, user productivity, and performance portability, even at the end of Moore's Law with approaching nano-scale semiconductor technology. Conventional practices employ distributed memory message passing programming interfaces, sometimes combining second level thread-based intra shared memory node interfaces such as OpenMP or with means of controlling heterogeneous components such as OpenCL for GPUs. While these methods include some modest runtime control, they are principally course grained and statically scheduled. Yet, performance for many real-world applications yield eciencies of less than 10% although some benchmarks may achieve 80% eciency or better (e.g., HPL). To address these challenges, strategies employing runtime software systems are being pursued to exploit information about the status of the application and the system hardware operation throughout the execution for purposes of introspection to guide the task scheduling and resource management in support of dynamic adaptive control. Runtime systems provide adaptive means to reduce the eects of starvation, latency, overhead, and contention. While each is unique in its details, many share common properties such as multi-tasking either preemptive or non-preemptive, message-driven computation such as active messages, sophisticated ne-grain synchronization such as dataow and futures contructs, global name or address spaces, and control policies for optimizing task scheduling in part to address the uncertainty of asynchrony. This survey will identify key parameters and properties of modern and sometimes experimental runtime systems actively employed today and provide a detailed description, summary, and comparison within a shared space of dimensions. It is not the intent of this paper to determine which is better or worse but rather to provide sucient detail to permit the reader to select among them according to individual need.

References

Baker, H.C., Hewitt, C.: The incremental garbage collection of processes. In: SIGART Bull. pp. 55–59. ACM, New York, NY, USA (August 1977), DOI: 10.1145/872736.806932

Danalis, A., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J.: Ptg: An abstraction for unhindered parallelism. In: 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing. pp. 21–30 (Nov 2014), DOI: 10.1109/wolfhpc.2014.8

Danalis, A., Jagode, H., Bosilca, G., Dongarra, J.: Parsec in practice: Optimizing a legacy chemistry application through distributed task-based execution. In: 2015 IEEE International Conference on Cluster Computing. pp. 304–313 (Sept 2015), DOI: 10.1109/cluster.2015.50

Dennard, R.H., Gaensslen, F., Yu, H.N., Rideout, L., Bassous, E., LeBlanc, A.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid State Circuits 9(5) (October 1974), DOI: 10.1109/jssc.1974.1050511

Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurrency and Computation, Practice and Experience 15(9) (July 2003), DOI: 10.1002/cpe.728

Fraguela, B.B., Bikshandi, G., Guo, J., Garzar ́an, M.J., Padua, D., Von Praun, C.: Optimization techniques for efficient HTA programs. Parallel Comput. 38(9), 465–484 (Sep 2012), DOI: 10.1016/j.parco.2012.05.002

Grout, R., Sankaran, R., Levesque, J., Woolley, C., Posy, S., Chen, J.: S3D direct numerical simulation: preparation for the 10-100 PF era (May 2012), http://on-demand.gputechconf.com/gtc/2012/presentations/S0625-GTC2012-S3D-Direct-Numerical.pdf

Hewitt, C., Baker, H.G.: Actors and continuous functionals. Tech. rep., Cambridge, MA, USA (1978)

Intel Corp.: Intel R Cilk Plus Language Specification (2010), version 0.9, document number 324396-001US, https://www.cilkplus.org/sites/default/files/open_specifications/cilk_plus_language_specification_0_9.pdf

Kale, L.V., Krishnan., S.: Charm++: Parallel programming with message-driven objects. In: Wilson, G.V., Lu, P. (eds.) Parallel Programming using C++, pp. 175–213. MIT Press (1996)

Kumar, V., Zheng, Y., Cav ́e, V., Budimli ́c, Z., Sarkar, V.: HabaneroUPC++: A compilerfree PGAS library. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. pp. 5:1–5:10. PGAS ’14, ACM, New York, NY, USA (2014), DOI: 10.1145/2676870.2676879

Lethin, R., Leung, A., Meister, B., Schweitz, E.: R-stream: A parametric high level compiler (2006), reservoir Labs Inc., Talk abstract, http://www.ll.mit.edu/HPEC/agendas/proc06/Day2/21_Schweitz_Abstract.pdf

MIT: Cilk 5.4.6 Reference Manual (1998), http://supertech.csail.mit.edu/cilk/manual-5.4.6.pdf

Schneider, T., Hoefler, T., Grant, R.E., Barrett, B.W., Brightwell, R.: Protocols for fully offloaded collective operations on accelerated network adapters. In: 42 nd International Conference on Parallel Processing. pp. 593–602 (Oct 2013)

Slaughter, E., Lee, W., Jia, Z., Warszawski, T., Aiken, A., McCormick, P., Ferenbaugh, C., Gutierrez, S., Davis, K., Shipman, G., Watkins, N., Bauer, M., Treichler, S.: Legion programming system (Feb 2017), version 16.10.0, http://legion.stanford.edu/

Suboti ́c, V., Brinkmann, S., Marjanovi, V., Badia, R.M., Gracia, J., Niethammer, C., Ayguade, E., Labarta, J., Valero, M.: Programmability and portability for exascale: Top down programming methodology and tools with starss. Journal of Computational Science 4(6), 450 – 456 (2013), http://www.sciencedirect.com/science/article/pii/S1877750313000203, scalable Algorithms for Large-Scale Systems Workshop (ScalA2011),

Supercomputing 2011

Tim, M., Romain, C.: OCR, the open community runtime interface (March 2016), version 1.1.0, https://xstack.exascale-tech.com/git/public?p=ocr.git;a=blob;f=ocr/spec/ocr-1.1.0.pdf

Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)

Von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E.: Active messages: A mechanism for integrated communication and computation. Proceedings of The 19th Annual International Symposium on Computer Architecture, 1992 pp. 256–266 (1992),DOI: 10.1109/isca.1992.753322

Wheeler, K., Murphy, R., Thain, D.: Qthreads: An API for programming with millions of lightweight threads. In: Proceedings of the 22 nd IEEE International Parallel and Distributed Processing Symposium (MTAAP ’08 workshop) (2008), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4536359

Wilke, J., Hollman, D., Slattengren, N., Lifflander, J., Kolla, H., Rizzi, F., Teranishi, K., Bennett, J.: DARMA 0.3.0-alpha specification (March 2016), version 0.3.0-alpha, SANDIA Report SAND2016-5397

The Charm++ parallel programming system manual, version 6.7.1, http://charm.cs.illinois.edu/manuals/pdf/charm++.pdf, accessed: 2017-02-15

IEEE Standard for Information Technology – Portable Operating System Interface (POSIX R ). IEEE Standard (2008), http://standards.ieee.org/findstds/standard/1003.1-2008.html, accessed: 2017-02-15

SWARM (SWift Adaptive Runtime Machine) (2011), white paper, http://www.

etinternational.com/files/2713/2128/2002/ETI-SWARM-whitepaper-11092011.pdf, accessed: 2017-02-15

Habanero-C (2013), website, https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C, accessed: 2017-02-15

DPLASMA: distributed parallel linear algebra software for multicore architectures (April 2014), version 1.2.0 http://icl.utk.edu/dplasma/, accessed: 2017-02-15

Intel R concurrent collections C++ API (June 2014), website, https://icnc.github.io/api/index.html, accessed: 2017-02-15

XcalableMP: a directive-based language extension for scalable and performance-aware parallel programming (Nov 2014), version 1.2.1 http://www.xcalablemp.org/, accessed: 2017-02-15

MPI: A Message-Passing Interface Standard (June 2015), specification document, http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf, accessed: 2017-02-15

The OpenACC application programming interface (October 2015), version 2.5, http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf, accessed: 2017-02-15

OpenMP application programming interface (November 2015), version 4.5, http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf, accessed: 2017-02-15

The PaRSEC generic framework for architecture aware scheduling and management of micro-tasks (Dec 2015), version 2.0.0 http://icl.cs.utk.edu/parsec/index.html, accessed: 2017-02-15

Argobots: a lightweight low-level threading/tasking framework (Nov 2016), version 1.0a1 http://www.argobots.org/, accessed: 2017-02-15

BOLT: a lightning-fast OpenMP implementation (Nov 2016), version 1.0a1 http://www.mcs.anl.gov/bolt/, accessed: 2017-02-15

GASNet low-level networking layer (Oct 2016), version 1.28.0, https://gasnet.lbl.gov/, accessed: 2017-02-15

HPX (July 2016), version 0.9.99, http://stellar.cct.lsu.edu/, accessed: 2017-02-15 37. HPX-5 (Nov 2016), version 4.0.0 http://hpx.crest.iu.edu/, accessed: 2017-02-15

The Mercurium source-to-source compilation infrastructure (June 2016), version 2.0.0 https://pm.bsc.es/mcxx, accessed: 2017-02-15

The Nanos++ runtime system (June 2016), version 0.10 https://pm.bsc.es/nanox, accessed: 2017-02-15

Omni (Nov 2016), version 1.1.0 http://omni-compiler.org/, accessed: 2017-02-15

The OmpSs programming model (June 2016), version 16.06 https://pm.bsc.es/ompss, accessed: 2017-02-15

Intel R Threading Building Blocks (Intel R TBB) (2017), website, http://www.threadingbuildingblocks.org, accessed: 2017-02-15