Exascale Machines Require New Programming Paradigms and Runtimes

Georges Da Costa; Thomas Fahringer; Juan Antonio Rico Gallego; Ivan Grasso; Atanas Hristov; Helen D. Karatza; Alexey Lastovetsky; Fabrizio Marozzo; Dana Petcu; Georgios L. Stavrinides; Domenico Talia; Paolo Trunfio; Hrachya Astsatryan

doi:10.14529/jsfi150201

Authors

Georges Da Costa IRIT, University of Toulouse, Toulouse
Thomas Fahringer University of Innsbruck, Innsbruck
Juan Antonio Rico Gallego University of Extremadura, Badajoz
Ivan Grasso University of Innsbruck, Innsbruck
Atanas Hristov UIST, Orhid
Helen D. Karatza Aristotle University of Thessaloniki, Thessaloniki
Alexey Lastovetsky University College Dublin, Dublin
Fabrizio Marozzo University of Calabria, Rende CS
Dana Petcu West University of Timisoara, Timisoara
Georgios L. Stavrinides Aristotle University of Thessaloniki, Thessaloniki
Domenico Talia University of Calabria, Rende CS
Paolo Trunfio University of Calabria, Rende CS
Hrachya Astsatryan National Academy of Sciences of Armenia, Yerevan

DOI:

https://doi.org/10.14529/jsfi150201

Abstract

Extreme scale parallel computing systems will have tens of thousands of optionally accelerator-equiped nodes with hundreds of cores each, as well as deep memory hierarchies and complex interconnect topologies. Such Exascale systems will provide hardware parallelism at multiple levels and will be energy constrained. Their extreme scale and the rapidly deteriorating reliablity of their hardware components means that Exascale systems will exhibit low mean-time-between-failure values. Furthermore, existing programming models already require heroic programming and optimisation efforts to achieve high efficiency on current supercomputers. Invariably, these efforts are platform-specific and non-portable. In this paper we will explore the shortcomings of existing programming models and runtime systems for large scale computing systems. We then propose and discuss important features of programming paradigms and runtime system to deal with large scale computing systems with a special focus on data-intensive applications and resilience.
Finally, we also discuss code sustainability issues and propose several software metrics that are of paramount importance for code development for large scale computing systems.

References

Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi, Basilio B. Fraguela, Mara J. Garzarn, David Padua, and Christoph von Praun, Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP ’06). ACM, 48-57, 2006. DOI: 10.1145/1122971.1122981.

William Gropp, Marc Snir. Programming for exascale computers. Computing in Science and Engineering, 15(6):27–35, 2013. DOI: 10.1109/mcse.2013.96.

John Jenkins, James Dinan, Pavan Balaji, Nagiza F. Samatova, and Rajeev Thakur. Enabling fast, noncontiguous GPU data movement in hybrid MPI+GPU environments. In IEEE International Conference on Cluster Computing (CLUSTER), pages 468–476, 2012. DOI: 10.1109/cluster.2012.72.

Jesper Larsson Träff, Antoine Rougier, and Sascha Hunold. Implementing a classic: Zero-copy all-to-all communication with MPI datatypes. In 28th ACM International Conference on Supercomputing (ICS), pages 135–144, 2014. DOI: 10.1145/2597652.2597662.

G. Bosilca, A. Bouteiller, and F. Cappello. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In ACM/IEEE Supercomputing Conference, page 29. IEEE, 2002. DOI: 10.1109/sc.2002.10048.

G. E. Fagg, A. Bukovsky, and J. J. Dongarra. HARNESS and fault tolerant MPI. Parallel Computing, 27(11):1479–1495, October 2001. DOI: 10.1016/s0167-8191(01)00100-4.

C. George and S. S. Vadhiyar. ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability. Procedia Computer Science, 9:166–175, 2012. DOI: 10.1016/j.procs.2012.04.018.

J. Dinan, P. Balaji, E. Lusk, P. Sadayappan, and R. Thakur. Hybrid parallel programming with MPI and unified parallel C. in Proceedings of the 7th ACM international conference on Computing frontiers, CF ’10, 2010. DOI: 10.1145/1787275.1787323.

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Träff, MPI on millions of cores. Parallel Processing Letters, vol. 21, no. 1, pp. 45–60, 2011. DOI: 10.1142/s0129626411000060.

MPI Forum: MPI: A Message-Passing Interface Standard. Version 3.0 (September 4 2012).

Jerry Eriksson, Radoslaw Januszewski, Olli-Pekka Lehto, Carlo Cavazzoni, Torsten Wilde and Jeanette Wilde. System Software and Application Development Environments. PRACE Second Implementation Phase Project, D11.2, 2014.

D. Unat, J. Shalf, T. Hoefler, T. Schulthess, A. Dubey et. al. Programming Abstractions for Data Locality. White Paper, PADAL Workshop, 28-29 April, 2014, Lugano Switzerland.

S. Amarasinghe, M. Hall, R. Lethin, K. Pingali, D. Quinlan, V. Sarkar, J. Shalf, R. Lucas, and K. Yelick. ASCR programming challenges for exascale computing. Report of the 2011 workshop on exascale programming challenges, University of Southern California, Information Sciences Institute, July 2011.

R. Lucas et. al. Top Ten Exascale Research Challenges. DOE ASCAC Subcommittee Report, February 2014.

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice & Experience, 23(2):187–198, February 2011. DOI: 10.1002/cpe.1631.

Ang, J. A., Barrett, R. F., Benner, R. E., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S. D., Hemmert, K. S., Kelly, S. M., Le, H., Leung, V. J., Resnick, D. R., Rodrigues, A. F., Shalf, J., Stark, D., Unat, D. and Wright, N. J. Abstract Machine Models and Proxy Architectures for Exascale Computing. Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing. Co-HPC ’14, New Orleans, Louisiana. IEEE Press 978-1-4799-7564-8, pages: 25-32. 2014. DOI: 10.1109/co-hpc.2014.4.

R. Thakur, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, J. Larsson Träff. MPI at Exascale. Department of Energy SciDAC workshop, Jul, 2010.

J. Dongarra et al. The International Exascale Software Project roadmap. International Journal of High Performance Computing Applications, 25:3–60, 2011.

J. Dinan, R. Grant, P. Balaji, D. Goodell, D. Miller, M. Snir, R. Thakur. Enabling communication concurrency through flexible MPI endpoints. International Journal of High Performance Computing Applications, volume 28, pages 390-405. 2014. DOI: 10.1177/1094342014548772.

J. Carretero, J. Garcia-Blas, D. Singh, F. Isaila, T. Fahringer, R. Prodan, G. Bosilca, A. Lastovetsky, C. Symeonidou, H. Perez-Sanchez, J. Cecilia. Optimizations to Enhance Sustainability of MPI Applications. Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA ’14, Kyoto, Japan, 2014. DOI: 10.1145/2642769.2642797.

I. Gorton, P. Greenfield, A. Szalay, R. Williams. Data-intensive computing in the 21st century. IEEE Computer, 41 (4), 30-32. DOI: 10.1109/mc.2008.122.

European Commission EPiGRAM project (grant agreement no 610598). http://www.epigram-project.eu.

K. Dichev, F. Reid, A. Lastovetsky. Efficient and Reliable Network Tomography in Heterogeneous Networks Using BitTorrent Broadcasts and Clustering Algorithms. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC12). Salt Lake City, Utah, USA. 2012. IEEE Computer Society Press, ISBN: 978-1-4673-0804-5, pp. 36:1–36:11. DOI: 10.1109/sc.2012.52.

Z. Zhong, V. Rychkov, A. Lastovetsky. Data Partitioning on Multicore and Multi-GPU Platforms Using Functional Performance Models. IEEE Transactions on Computers, PrePrints. DOI: 10.1109/TC.2014.2375202.

E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, et al. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219–237, 2005. DOI: 10.1155/2005/128026.

K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame, F. Bacall, A. Hardisty, A. Nieva de la Hidalga, M. P. Balcazar Vargas, S. Sufi, and C. Goble. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Research, 41(W1):W557–W561, July 2013. DOI: 10.1093/nar/gkt328.

B. Ludscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10):1039–1065, 2006. DOI: 10.1002/cpe.994.

J. Kranjc, V. Podpecan, and N. Lavrac. ClowdFlows: A Cloud Based Scientific Workflow Platform. In P. Flach, T. Bie, and N. Cristianini, editors, Machine Learning and Knowledge Discovery in Databases, LNCS 7524: 816–819. Springer, Heidelberg, Germany, 2012. DOI: 10.1007/978-3-642-33486-3_54.

H. Hiden, S.Woodman, P.Watson, and J. Cala. Developing cloud applications using the e-Science Central platform. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1983), January 2013. DOI: 10.1098/rsta.2012.0085.

F. Lordan, E. Tejedor, J. Ejarque, R. Rafanell, J. Álvarez, F. Marozzo, D. Lezzi, R. Sirvent, D. Talia, and R. M. Badia. ServiceSs: An Interoperable Programming Framework for the Cloud. Journal of Grid Computing, vol. 12, n. 1, pp. 67-91, 2014. DOI: 10.1007/s10723-013-9272-5.

J. Diaz, C. Munoz-Caro and A. Nino. A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era, Parallel and Distributed Systems, IEEE Transactions on , vol.23, no.8, pp.1369-1386, Aug. 2012. DOI: 10.1109/tpds.2011.308.

K. Maheshwari and J. Montagnat. Scientific workflow development using both visual and script-based representation. In Proceedings of the 2010 6th World Congress on Services, SERVICES ’10, pages 328–335, Washington, DC, USA, 2010. DOI: 10.1109/services.2010.14.

F. Marozzo, D. Talia, P. Trunfio, “JS4Cloud: Script-based Workflow Programming for Scalable Data Analysis on Cloud Platforms”. Concurrency and Computation: Practice and Experience, Wiley InterScience, 2015. DOI: 10.1002/cpe.3563.

H. D. Karatza. Performance analysis of a distributed system under time-varying workload and processor failures. Proceedings of the 1st Balkan Conference on Informatics (BCI’03), Nov. 2003, Thessaloniki, Greece , pp. 502–516.

M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring automatic, online failure recovery for scientific applications at extreme scales. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14 ), Nov. 2014, New Orleans, USA, pp. 895–906. DOI: 10.1109/sc.2014.78.

D. Zhao, D. Zhang, K. Wang, and I. Raicu. Exploring reliability of exascale systems through simulations. Proceedings of the High Performance Computing Symposium (HPC’13), Apr. 2013, San Diego, USA, pp. 1–9.

G. Lu, Z. Zheng, and A. A. Chien. When is multi-version checkpointing needed? Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale (FTXS’13), Jun. 2013, New York, USA, pp. 49–56. DOI: 10.1145/2465813.2465821.

Numrich, Robert W., and John Reid. Co-Array Fortran for parallel programming. ACM Sigplan Fortran Forum. Vol. 17. No. 2. ACM, 1998. DOI: 10.1145/289918.289920.

F. Marozzo, D. Talia, P. Trunfio. Cloud Services for Distributed Knowledge Discovery. In: Encyclopedia of Cloud Computing, S. Murugesan, I. Bojanova (Editors), Wiley-IEEE, 2016.

El-Ghazawi, Tarek, and Lauren Smith. UPC: unified parallel C. Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM, 2006. DOI: 10.1145/1188455.1188483.

G. L. Stavrinides and H. D. Karatza. Fault-tolerant gang scheduling in distributed real-time systems utilizing imprecise computations. Simulation: Transactions of the Society for Modeling and Simulation International, vol. 85, no. 8, 2009, pp. 525–536. DOI: 10.1177/0037549709340729.

M. A. Heroux. Toward resilient algorithms and applications. arXiv:1402.3809, March 2014.

I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, and B. R. de Supinski. Evaluating user-level fault tolerance for MPI applications. Proceedings of the 21st European MPI Users’ Group Meeting (EuroMPI/ASIA’14), Sep. 2014, Kyoto, Japan, pp. 57–62. DOI: 10.1145/2642769.2642775.

K. Teranishi and M. A. Heroux. Toward local failure local recovery resilience model using MPI-ULFM. Proceedings of the 21st European MPI Users’ Group Meeting (EuroMPI/ASIA’14), Sep. 2014, Kyoto, Japan, pp. 51–56. DOI: 10.1145/2642769.2642774.

Leandro Fontoura Cupertino, Georges Da Costa, Jean-Marc Pierson. Towards a generic power estimator. In : Computer Science - Research and Development, Springer Berlin / Heidelberg, Special issue : Ena-HPC 2014, July 2014. DOI: 10.1007/s00450-014-0264-x.

Shalf, John, Sudip Dosanjh, and John Morrison. Exascale computing technology challenges. High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1-25. DOI: 10.1007/978-3-642-19328-6_1.

Cole, Murray. Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel computing 30 (3), Elsevier 2004: pages 389-406. DOI: 10.1016/j.parco.2003.12.002.

Petiton, S., Sato, M., Emad, N., Calvin, C., Tsuji, M., and Dandouna, M. Multi level programming Paradigm for Extreme Computing. In SNA+ MC 2013-Joint International Conference on Supercomputing in Nuclear Applications+ Monte Carlo, 2014. EDP Sciences. DOI: 10.1051/snamc/201404305.

Ramirez, A. (2011). European scalable and power efficient HPC platform based on low-power embedded technology. On-Line. Access date: March/2012. URL: http://www.eesi-_project.eu/media/BarcelonaConference/Day2/13-_Mont-_Blanc_Overview.pdf.

Nukada, Akira, Kento Sato, and Satoshi Matsuoka. Scalable multi-gpu 3-d fft for tsubame 2.0 supercomputer. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 2012. DOI: 10.1109/sc.2012.100.

Choi, Jaeyoung, James Demmel, Inderjiit Dhillon, Jack Dongarra, Susan Ostrouchov, Antoine Petitet, Ken Stanley, David Walker, and R. Clinton Whaley. ScaLAPACK: A portable linear algebra library for distributed memory computers—Design issues and performance. In Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science, pp. 95–106. Springer Berlin Heidelberg, 1996. DOI: 10.1007/3-540-60902-4_12.

Lawson, Chuck L., Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software (TOMS) 5, no. 3 (1979): 308–323. DOI: 10.1145/355841.355847.

OCL-MLA, http://tuxfan.github.com/ocl-_mla/.

Bariş Eskikaya and D Turgay Altilar, “Distributed OpenCL Distributing OpenCL Platform on Network Scale”, IJCA 2012, pp. 26–30.

Philipp Kegel, Michel Steuwer and Sergei Gorlatch, “dOpenCL: Towards a Uniform Programming Approach for Distributed Heterogeneous Multi-/Many-Core Systems” IPDPS Workshops 2012. DOI: 10.1109/ipdpsw.2012.16.

Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo and Jaejin Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters”, ICS 2012. DOI: 10.1145/2304576.2304623.

Shucai Xiao and Wu-chun Feng, “Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems”, IPDPS Workshops, 2012. DOI: 10.1109/ipdpsw.2012.325.

Ridvan Özaydin and D. Turgay Altilar, “OpenCL Remote: Extending OpenCL Platform Model to Network Scale”, HPCC-ICESS 2012. DOI: 10.1109/hpcc.2012.117.

Alves Albano, Rufino Jose, Pina Antonio and Santos Luis Paulo, “Enabling task-level scheduling on heterogeneous platforms”, Workshop GPGPU, 2012.

Khronos OpenCL Working Group, “The OpenCL 1.2 specification”, 2012, http://www.khronos.org/opencl.

José Duato, Antonio J. Peña, Federico Silla, Rafael Mayo and Enrique S. Quintana-Ortí, “rCUDA: Reducing the number of GPU-based accelerators in high performance clusters”, HPCS 2010. DOI: 10.1109/hpcs.2010.5547126.

Orion S. Lawlor, “Message passing for GPGPU clusters: CudaMPI”, CLUSTER 2009. DOI: 10.1109/clustr.2009.5289129.

Chamberlain, Bradford L., David Callahan, and Hans P. Zima. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications 21.3 (2007): 291-312. DOI: 10.1177/1094342007078442.

Charles, Philippe, et al. X10: an object-oriented approach to non-uniform cluster computing. ACM Sigplan Notices 40.10 (2005): 519-538. DOI: 10.1145/1103845.1094852.

Sun Enqiang, Schaa Dana, Bagley Richard, Rubin Norman and Kaeli David, “Enabling task-level scheduling on heterogeneous platforms”, WORKSHOP GPGPU 2012. DOI: 10.1145/2159430.2159440.

Magnus Strengert, Christoph Müller, Carsten Dachsbacher and Thomas Ertl, “CUDASA: Compute Unified Device and Systems Architecture”, EGPGV 2008.

Bueno Javier, Planas Judit, Duran Alejandro, Badia Rosa M., Martorell Xavier, Ayguade Eduard and Labarta Jesus, “Productive Programming of GPU Clusters with OmpSs”, IPDPS 2012. DOI: 10.1109/ipdps.2012.58.

Ryo Aoki, Shuichi Oikawa, Takashi Nakamura and Satoshi Miki, “Hybrid OpenCL: Enhancing OpenCL for Distributed Processing”, ISPA 2011. DOI: 10.1109/ispa.2011.28.

A. Barak, T. Ben-Nun, E. Levy and A. Shiloh, “A Package for OpenCL Based Heterogeneous Computing on Clusters with Many GPU Devices”, Workshop PPAC 2010. DOI: 10.1109/clusterwksp.2010.5613086.

Simple-opencl http://code.google.com/p/simple-_opencl/.

Ivan Grasso, Simone Pellegrini, Biagio Cosenza, Thomas Fahringer. “libwater: Heterogeneous Distributed Computing Made Easy”, ACM International Conference on Supercomputing, Eugene, USA, 2013. DOI: 10.1145/2464996.2465008.

Nesus European Cost Action IC1305 http://www.nesus.eu/.

NCF, Peter Michielse, and Patrick Aerts NCF. “European Exascale Software Initiative”.

Dongarra, Jack. “The international exascale software project roadmap”. International Journal of High Performance Computing Applications (2011): 1094342010391989. APA.

Computing Language Utility, Intel Corporation, http://software.intel.com/.