How to Assess the Quality of Supercomputer Resource Usage
DOI:
https://doi.org/10.14529/jsfi220301Keywords:
supercomputing, high-performance computing, performance analysis, monitoring, workload analysis, resource utilization, resource provisioningAbstract
Supercomputer is an exceptionally valuable computational resource and it must be used as efficiently as possible. However, in practice, the efficiency of its usage leaves much to be desired. There are various reasons for this. One of the main ones is the low performance of user applications, but users themselves are often not aware of the presence of performance issues in their programs. Therefore, it is necessary for administrators of a supercomputer to be able to constantly monitor the performance and behavior of all running jobs. However, the problem is that the commonly used metrics for assessing the quality of resource consumption (such as CPU or GPU load, the amount of bytes transferred over the MPI network, etc.) are often far from being convenient and accurate. This paper describes the implementation and evaluation of the previously proposed assessment system, which, in our opinion, makes it possible to significantly ease the task of properly evaluating the quality of the supercomputer resource usage. We also touch upon another topic related to the assessment of the quality of using HPC resources — organization of HPC resource provisioning.
References
CUPTI :: CUDA Toolkit Documentation, https://docs.nvidia.com/cuda/cupti/index.html
High Performance Computing Market Size to Surpass USD 64.65, https://www.globenewswire.com/news-release/2022/04/04/2415844/0/en/High-Performance-Computing-Market-Size-to-Surpass-USD-64-65-Bn-by-2030.html
POP Standard Metrics for Parallel Performance Analysis | Performance Optimisation and Productivity, https://pop-coe.eu/node/69
Top-down Microarchitecture Analysis Method using VTune, https://software.intel.com/en-us/vtune-cookbook-top-down-microarchitecture-analysis-method
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, Seattle, Washington, USA, July 1994. Technical Report WS-94-03. pp. 359–370. AAAI Press (1994). https://doi.org/10.5555/3000850.3000887
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, June 21-26, 2014. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1188–1196. JMLR.org (2014), http://proceedings.mlr.press/v32/le14.html
Nikitenko, D.A., Shvets, P.A., Voevodin, V.V.: Why do users need to take care of their HPC applications efficiency? Lobachevskii Journal of Mathematics 41(8), 1521–1532 (2020). https://doi.org/10.1134/s1995080220080132
Nikitenko, D., Voevodin, Vad.V., Zhumatiy, S.: Driving a petascale HPC center with Octoshell management system. Lobachevskii Journal of Mathematics 40(11), 1817–1830 (2019). https://doi.org/10.1134/S1995080219110192
Röhl, T., Eitzinger, J., Hager, G., Wellein, G.: LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses. In: 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017, Honolulu, HI, USA, September 5- 8, 2017. pp. 781–784. IEEE (2017). https://doi.org/10.1109/CLUSTER.2017.115
Schulz, M., de Supinski, B.R.: Pnmpi tools: a whole lot greater than the sum of their parts. In: Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC 2007, Reno, Nevada, USA, November 10-16, 2007. ACM Press (2007). https://doi.org/10.1145/1362622.1362663
Shaikhislamov, D., Voevodin, Vad.: Solving the problem of detecting similar supercomputer applications using machine learning methods. In: Parallel Computational Technologies. CCIS, vol. 1263, pp. 46–57. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55326-5_4
Shvets, P., Voevodin, V., Zhumatiy, S.: Primary automatic analysis of the entire flow of supercomputer applications. In: CEUR Workshop Proceedings. pp. 20–32 (2018)
Shvets, P., Voevodin, Vad., Nikitenko, D.: Approach to workload analysis of large HPC centers. In: Parallel Computational Technologies. CCIS, vol. 1263, pp. 16–30. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55326-5_2
Stefanov, K., Voevodin, Vl., Zhumatiy, S., Voevodin, Vad.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science 66, 625–634 (2015). https://doi.org/10.1016/j.procs.2015.11.071
Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collecting performance data with PAPI-C. In: Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing, September 2009, ZIH, Dresden. pp. 157–173. Springer (2009). https://doi.org/10.1007/978-3-642-11261-4_11
Thompson, A.P., Aktulga, H.M., Berger, R., et al.: LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comp. Phys. Comm. 271, 108171 (2022). https://doi.org/10.1016/j.cpc.2021.108171
Voevodin, Vad., Zhumatiy, S.: Universal assessment system for analyzing the quality of supercomputer resources usage. In: Supercomputing. RuSCDays 2021. CCIS, vol. 1510, pp. 427–442. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92864-3_33
Voevodin, Vad.V., Chulkevich, R.A., Kostenetskiy, P.S., et al.: Administration, Monitoring and Analysis of Supercomputers in Russia: a Survey of 10 HPC Centers. Supercomputing Frontiers and Innovations 8(3), 82–103 (Oct 2021). https://doi.org/10.14529/jsfi210305
Voevodin, Vad.V., Stefanov, K.S., Zhumatiy, S.A.: Overhead analysis for performance monitoring counters multiplexing. In: Russian Supercomputing Days, RuSCDays 2022. LNCS, Springer, Cham (2022, in print)
Yasin, A.: A Top-Down method for performance analysis and counters architecture. In: ISPASS 2014 - IEEE International Symposium on Performance Analysis of Systems and Software, Monterey, CA, USA, March 23-25, 2014. pp. 35–44. IEEE (2014). https://doi.org/10.1109/ISPASS.2014.6844459
Zhou, K., Krentel, M.W., Mellor-Crummey, J.: Tools for top-down performance analysis of GPU-accelerated applications. In: Proc. of the 34th ACM Int. Conf. on Supercomputing, Barcelona, Spain, June, 2020. pp. 1–12. ACM (2020). https://doi.org/10.1145/3392717.3392752
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.