Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems
DOI:
https://doi.org/10.14529/jsfi200104Abstract
Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior.
We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.
References
Abraham, M.J., Murtola, T., Schulz, R., et al.: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1-2, 19–25 (2015), DOI: 10.1016/j.softx.2015.06.001
Agung, M., Amrizal, M.A., Komatsu, K., et al.: A memory congestion-aware MPI process placement for modern NUMA systems. In: 2017 IEEE 24th International Conference on High Performance Computing, HiPC, 18-21 Dec. 2017, Jaipur, India. pp. 152–161. IEEE (2017), DOI: 10.1109/HiPC.2017.00026
Agung, M., Amrizal, M.A., Egawa, R., et al.: An automatic MPI process mapping method considering locality and memory congestion on NUMA systems. In: 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC, 1-4 Oct. 2019, Singapore. pp. 17–24. IEEE (2019), DOI: 10.1109/MCSoC.2019.00010
Bailey, D., Barszcz, E., Barton, J., et al.: The NAS Parallel Benchmarks. Int. J. High Perform. Comput. Appl. 5(3), 63–73 (1991), DOI: 10.1177/109434209100500306
Barak, A., Margolin, A., Shiloh, A.: Automatic resource-centric process migration for MPI. In: Traff, J.L., Benkner, S., Dongarra, J.J. (eds.) Recent Advances in the Message Passing Interface, 23-26 Sep. 2012, Vienna, Austria. pp. 163–172. Springer, Berlin, Heidelberg (2012), DOI: 10.1007/978-3-642-33518-1_21
Bosilca, G., Foyer, C., Jeannot, E., et al.: Online Dynamic Monitoring of MPI Communications. In: European Conference on Parallel Processing, 28 Aug-1 Sep. 2017, Santiago de Compostela, Spain. pp. 49–62. Springer, Cham (2017), DOI: 10.1007/978-3-319-64203-1_4
Broquedis, F., Clet-Ortega, J., Moreaud, S., et al.: Hwloc: A generic framework for managing hardware affinities in HPC applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, 17-19 Feb. 2010, Pisa, Italy. pp. 180–186. IEEE (2010), DOI: 10.1109/PDP.2010.67
Buntinas, D., Mercier, G., Gropp, W.: Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem. In: Mohr, B., Traff, J.L., Worringen, J., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, 17-20 Sept. 2006, Germany. pp. 86–95. Springer, Berlin, Heidelberg (2006), DOI: 10.1007/11846802_19
Chen, H., Chen, W., Huang, J., et al.: MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In: Proceedings of the 20th Annual International Conference on Supercomputing, 28 June-1 July, 2006, Cairns, Queensland, Australia. pp. 353–360. ACM (2006), DOI: 10.1145/1183401.1183451
Dashti, M., Fedorova, A., Funston, J., et al.: Traffic management: A holistic approach to memory placement on NUMA systems. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Houston, Texas, USA. pp. 381–394. ACM, New York, NY, USA (2013), DOI: 10.1145/2451116.2451157
David, H., Gorbatov, E., Hanebutte, U.R., et al.: RAPL: Memory power estimation and capping. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design, 18-20 Aug. 2010, Austin, TX, USA. pp. 189–194. IEEE (2010), DOI: 10.1145/1840845.1840883
Diener, M., Cruz, E.H., Alves, M.A., et al.: Affinity-based thread and data mapping in shared memory systems. ACM Computing Surveys 49(4), 64 (2017), DOI: 10.1145/3006385
Diener, M., Cruz, E.H., Navaux, P.O., et al.: Communication-aware process and thread mapping using online communication detection. Parallel Comput. 43(C), 43–63 (2015), DOI: 10.1016/j.parco.2015.01.005
Dozsa, G., Kumar, S., Balaji, P., et al.: Enabling concurrent multithreaded MPI communication on multicore petascale systems. In: Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, 12-15 Sept. 2010, Stuttgart, Germany. pp. 11–20. Springer-Verlag, Berlin, Heidelberg (2010), DOI: 10.1007/978-3-642-15646-5_2
Gabriel, E., Fagg, G.E., Bosilca, G., et al.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: European Parallel Virtual Machine/Message Passing Interface Users Group Meeting, 19-22 Sept. 2004, Budapest, Hungary. pp. 97–104. Springer, Berlin, Heidelberg (2004), DOI: 10.1007/978-3-540-30218-6_19
Gaud, F., Lepers, B., Funston, J., et al.: Challenges of memory management on modern NUMA systems. Commun. ACM 58(12), 59–66 (2015), DOI: 10.1145/2814328
Goglin, B., Moreaud, S.: KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework. Journal of Parallel and Distributed Computing 73(2), 176–188 (2013), DOI: 10.1016/j.jpdc.2012.09.016
Gropp, W.: MPICH2: A new start for MPI implementations. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, 19-22 Sept. 2004, Budapest, Hungary. pp. 7–7. Springer, Berlin, Heidelberg (2002), DOI: 10.1007/3-540-45825-5_5
Hofmann, J., Fey, D., Eitzinger, J., et al.: Analysis of Intel’s Haswell Microarchitecture Using the ECM Model and Microbenchmarks. In: Architecture of Computing Systems, ARCS 2016, 4-7 April 2016, Nuremberg, Germany. pp. 210–222. Springer, Cham (2016), DOI: 10.1007/978-3-319-30695-7_16
Intel: Intel Xeon Processor E5 and E7 v4 Product Families Uncore Performance Monitoring
Reference Manual. https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html (2016)
Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters:algorithmic issues and practical techniques. IEEE Transactions on Parallel and Distributed Systems 25(4), 993–1002 (2014), DOI: 10.1109/TPDS.2013.104
Kerrisk, M.: Linux/UNIX System Programming: POSIX Shared Memory. http://man7.org/training/download/posix_shm_slides.pdf (2015), accessed: 2019-05-14
Lepers, B., Quema, V., Fedorova, A.: Thread and memory placement on NUMA systems: Asymmetry matters. In: Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, 8-10 July 2015, Santa Clara, CA. pp. 277–289. Berkeley, CA, USA (2015), DOI: 10.5555/2813767.2813788
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard. http://www.mpi-forum.org (2012)
Molka, D., Hackenberg, D., Schone, R.: Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer. In: Proceedings of the Workshop on Memory Systems Performance and Correctness, Edinburgh, United Kingdom. pp. 4:1–4:10. ACM, New York, NY, USA (2014), DOI: 10.1145/2618128.2618129
Orduna, J.M., Silla, F., Duato, J.: On the development of a communication-aware task mapping technique. J. Syst. Archit. 50(4), 207–220 (2004), DOI: 10.1016/j.sysarc.2003.09.002
PRACE: Unified European Applications Benchmark Suite. www.prace-ri.eu/ueabs (2013), accessed: 2019-10-01
Sodani, A.: Knights landing (KNL): 2nd Generation Intel Xeon Phi processor. In: 2015 IEEE Hot Chips 27 Symposium, 22-25 Aug. 2015, Cupertino, CA, USA. pp. 1–24. IEEE (2015), DOI: 10.1109/HOTCHIPS.2015.7477467
Treibig, J., Hager, G., Wellein, G.: Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, 13-16 Sept. 2010, San Diego, CA, USA. pp. 207–216. IEEE (2010), DOI: 10.1109/ICPPW.2010.38
Zhai, J., Sheng, T., He, J., et al.: Efficiently acquiring communication traces for large-scale parallel applications. IEEE Transactions on Parallel and Distributed Systems 22(11), 1862–1870 (2011), DOI: 10.1109/TPDS.2011.49
Zhang, J., Zhai, J., Chen, W., et al.: Process mapping for MPI collective communications. In: Euro-Par 2009 Parallel Processing, 25-28 Aug. 2009, Delft, The Netherlands. pp. 81–92. Springer, Berlin, Heidelberg (2009), DOI: 10.1007/978-3-642-03869-3_11
Ziakas, D., Baum, A., Maddox, R.A., et al.: Intel QuickPath Interconnect architectural features supporting scalable system architectures. In: 2010 18th IEEE Symposium on High Performance Interconnects, 18-20 Aug. 2010, Mountain View, CA, USA. pp. 1–6. IEEE (2010), DOI: 10.1109/HOTI.2010.24
Zivanovic, D., Pavlovic, M., Radulovic, M., et al.: Main memory in HPC: Do we need more or could we live with less? ACM Trans. Archit. Code Optim. 14(1), 3:1–3:26 (2017), DOI: 10.1145/3023362
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.