Effects of Using a Memory Stalled Core for Handling MPI Communication Overlapping in the SOR Solver on SX-ACE and SX-Aurora TSUBASA

Takashi Soga; Kenta Yamaguchi; Raghunandan Mathur; Osamu Watanabe; Akihiro Musa; Ryusuke Egawa; Hiroaki Kobayashi

doi:10.14529/jsfi200401

Authors

Takashi Soga NEC Solution Innovators, Ltd.
Kenta Yamaguchi NEC Solution Innovators, Ltd.
Raghunandan Mathur NEC Corporation
Osamu Watanabe NEC Corporation
Akihiro Musa NEC Corporation
Ryusuke Egawa Tohoku University
Hiroaki Kobayashi Tohoku University

DOI:

https://doi.org/10.14529/jsfi200401

Abstract

Modern high-performance computing (HPC) systems consist of a large number of nodes featuring multi-core processors. Many computational fluid dynamics (CFD) codes utilize a Message Passing Interface (MPI) to exploit the potential of such systems. In general, the MPI communication costs increase as the number of MPI processes increases. In this paper, we discuss performance of the code in which a core is used as a dedicated communication core when the core cannot contribute to the performance improvement due to memory-bandwidth limitations. By using the dedicated communication core, the communication operations are overlapped with computation operations, thus enabling highly efficient computation by exploiting the limited memory bandwidth and idle cores. The performance evaluation shows that this code can hide the MPI communication times of 90% on the supercomputer SX-ACE system and 80% on the supercomputer SX-Aurora TSUBASA system, and the performance of the successive over-relaxation (SOR) method is improved by 32% on SX-ACE and 20% on SX-Aurora TSUBASA.

References

Top 500 the list. https://www.top500.org/

Vector supercomputer SX series SX-ACE. https://de.nec.com/de_DE/en/documents/SX-ACE-brochure.pdf

Castillo, E., Jain, N., Casas, M., et al.: Optimizing computation-communication overlap in asynchronous task-based programs. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’19, June 2019, Phoenix, Arizona, USA. pp. 380–391. ACM (2019), DOI: 10.1145/3330345.3330379

Egawa, R., Komatsu, K., Takizawa, H., et al.: Early evaluation of the SX-ACE processor. In: Proceedings of the 27th International Conference for High Performance Computing, Networking, Storage and Analysis (2014)

Gorobets, A., Soukov, S., Bogdanov, P.: Multilevel parallelization for simulating compressible turbulent flows on most kinds of hybrid supercomputers. Computers & Fluids 173, 171–177 (2018), DOI: 10.1016/j.compfluid.2018.03.011

Idomura, Y., Nakata, M., Yamada, S., et al.: Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. The International Journal of High Performance Computing Applications 28(1), 73–86 (2014), DOI: 10.1177/1094342013490973

Iwashita, T., Shimasaki, M.: Algebraic block red-black ordering method for parallelized ICCG solver with fast convergence and low communication costs. IEEE Transactions on Magnetics 39(3), 1713–1716 (2003), DOI: 10.1109/TMAG.2003.810531

Komatsu, K., Momose, S., Isobe, Y., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, 11-16 November 2018, Dallas, Texas, USA. IEEE Press (2018), DOI: 10.5555/3291656.3291728

Mattson, T.G., He, Y., Koniges, A.E.: The OpenMP Common Core. The MIT Press (2019)

Momose, S., Hagiwara, T., Isobe, Y., et al.: The brand-new vector supercomputer, SXACE. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) Supercomputing. Lecture Notes in Computer Science, vol. 8488, pp. 199–214. Springer, Cham (2014), DOI: 10.1007/978-3-319-07518-1_13

Musa, A., Sato, Y., Soga, T., et al.: Effects of MSHR and prefetch mechanisms on an onchip cache of the vector architecture. In: 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications, 10-12 Dec. 2008, Sydney, NSW, Australia. pp. 335–342. IEEE (2008), DOI: 10.1109/ISPA.2008.100

Oyarzun, G., Borrell, R., Gorobets, A., et al.: Efficient CFD code implementation for the ARM-based Mont-Blanc architecture. Future Generation Computer Systems 79, 786–796 (2018), DOI: 10.1016/j.future.2017.09.029

Sergent, M., Dagrada, M., Carribault, P., et al.: Efficient communication/computation overlap with MPI+OpenMP runtimes collaboration. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018: Parallel Processing. Lecture Notes in Computer Science, vol. 11014, pp. 560–572. Springer, Cham (2018), DOI: 10.1007/978-3-319-96983-1_40

Shigeta, M.: Turbulence modelling of thermal plasma flows. Journal of Physics D: Applied Physics 49(49), 493001 (2016), DOI: 10.1088/0022-3727/49/49/493001

Soga, T., Yamaguchi, K., Mathur, R., et al.: Effects of using a memory-stalled core for handling MPI communication overlapping in the SOR solver. In: The 29th International Conference on Parallel Computational Fluid Dynamics, 15-17 May 2017, Glasgow, UK (2017)

Soga, T., Musa, A., Okabe, K., et al.: Performance of SOR methods on modern vector and scalar processors. Computers & Fluids 45(1), 215–221 (2011), DOI: 10.1016/j.compfluid.2010.12.024

Yamada, Y., Momose, S.: Vector Engine Processor of NEC Brand-New supercomputer SX-Aurora TSUBASA. In: International symposium on High Performance Chips, Hot Chips 2018, August 2018, Cupertino, USA (2018)