A Review of Supercomputer Performance Monitoring Systems

Konstantin S. Stefanov; Sucheta Pawar; Ashish Ranjan; Sanjay Wandhekar; Vladimir V. Voevodin

doi:10.14529/jsfi210304

Authors

Konstantin S. Stefanov Lomonosov Moscow State University
Sucheta Pawar HPC-Tech Group, Centre for Development of Advanced Computing, Pune, India
Ashish Ranjan HPC-Tech Group, Centre for Development of Advanced Computing, Pune, India
Sanjay Wandhekar HPC-Tech Group, Centre for Development of Advanced Computing, Pune, India
Vladimir V. Voevodin Lomonosov Moscow State University

DOI:

https://doi.org/10.14529/jsfi210304

Keywords:

monitoring, supercomputers, performance monitoring, review

Abstract

High Performance Computing is now one of the emerging fields in computer science and its applications. Top HPC facilities, supercomputers, offer great opportunities in modeling diverse processes thus allowing to create more and greater products without full-scale experiments. Current supercomputers and applications for them are very complex and thus are hard to use efficiently. Performance monitoring systems are the tools that help to understand the efficiency of supercomputing applications and overall supercomputer functioning. These systems collect data on what happens on a supercomputer (performance data, performance metrics) and present them in a way allowing to make conclusions about performance issues in programs running on the supercomputer. In this paper we give an overview of existing performance monitoring systems designed for or used on supercomputers. We give a comparison of performance monitoring systems found in literature, describe problems emerging in monitoring large scale HPC systems, and outline our vision on future direction of HPC monitoring systems development.

References

CLUMON. https://web.archive.org/web/20090517125016/http://clumon.ncsa.uiuc.edu/, accessed: 2021-06-16

Data Center GPU Manager Documentation. https://docs.nvidia.com/datacenter/dcgm/latest/index.html, accessed: 2021-07-19

ExaMon | Exascale Monitoring Framework for HPC. http://projects.eees.dei.unibo.it/monitoring/wordpress/, accessed: 2021-06-08

Ganglia Monitoring System. http://ganglia.info/, accessed: 2021-06-09

GitHub - ovis-hpc/sos: sos pre-release stable. https://github.com/ovis-hpc/sos, accessed: 2021-06-11

Grafana - The open platform for analytics and monitoring. https://grafana.com/, accessed: 2021-06-11

InfluxData (InfluxDB) | Time Series Database Monitoring & Analytics. https://www.influxdata.com/, accessed: 2021-03-17

KairosDB. https://kairosdb.github.io/, accessed: 2021-08-24

Kibana: Explore, Visualize, Discover Data | Elastic. https://www.elastic.co/kibana/, accessed: 2021-07-19

Mpp2 - Cluster Platform 6000 rx2600 Itanium2 1.5 GHz, Quadrics | TOP500. https://www.top500.org/system/173082/, accessed: 2021-05-28

MQTT - The Standard for IoT Messaging. https://mqtt.org/, accessed: 2021-06-17

MySQL. https://www.mysql.com/, accessed: 2021-06-11

Nagios - The Industry Standard in IT Infrastructure Monitoring. http://www.nagios.org/, accessed: 2021-06-18

National Supercomputing Mission. https://nsmindia.in/, accessed: 2021-07-19

Open XDMoD. https://open.xdmod.org/9.5/index.html, accessed: 2021-06-11

OVISWiki. https://ovis.ca.sandia.gov/index.php/Main_Page, accessed: 2021-06-11

Performance Co-Pilot. http://pcp.io/, accessed: 2021-06-11

PostgreSQL: The world’s most advanced open source database. https://www.postgresql.org/, accessed: 2021-06-23

Redash helps you make sense of your data. https://redash.io/, accessed: 2021-06-17

RRDtool - About RRDtool. http://oss.oetiker.ch/rrdtool/, accessed: 2021-06-11

The most popular database for modern apps | MongoDB. https://www.mongodb.com/, accessed: 2021-06-17

Aaziz, O., Cook, J., Sharifi, H.: Push Me Pull You: Integrating Opposing Data Transport Modes for Efficient HPC Application Monitoring. In: 2015 IEEE International Conference on Cluster Computing. pp. 674–681. IEEE (2015). https://doi.org/10.1109/CLUSTER.2015.118

Adhianto, L., Banerjee, S., Fagan, M., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22(6), 685–701 (2010). https://doi.org/10.1002/cpe.1553

Agelastos, A., Allan, B., Brandt, J., et al.: The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014. pp. 154–165. IEEE (2014). https://doi.org/10.1109/SC.2014.18

Agrawal, K., Fahey, M.R., McLay, R., James, D.: User Environment Tracking and Problem Detection with XALT. In: 2014 First International Workshop on HPC User Support Tools. pp. 32–40. IEEE (2014). https://doi.org/10.1109/HUST.2014.6

Brandt, J.M., Debusschere, B.J., Gentile, A.C., et al.: Ovis-2: A robust distributed architecture for scalable RAS. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. pp. 1–8. IEEE (2008). https://doi.org/10.1109/IPDPS.2008.4536549

Browne, J.C., DeLeon, R.L., Lu, C.D., et al.: Enabling comprehensive data-driven system management for large computational facilities. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. pp. 1–11. ACM, New York, NY, USA (2013). https://doi.org/10.1145/2503210.2503230

Buyya, R.: PARMON: a portable and scalable monitoring system for clusters. Software: Practice and Experience 30(7), 723–739 (2000). https://doi.org/10.1002/(SICI)1097-024X(200006)30:7<723::AID-SPE314>3.0.CO;2-5

Byford, N., Popov, S., Paterson, A.: Anomaly Detection in High Performance Computing Systems. In: Kos, L. (ed.) Summer of HPC 2020, pp. 12–14 (2020). https://summerofhpc.prace-ri.eu/wp-content/uploads/2020/12/SoHPC2020-reports.pdf

Das, A., Mueller, F., Siegel, C., Vishnu, A.: Desh: deep learning for system health prediction of lead times to failure in HPC. In: HPDC’18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. pp. 40–51. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3208040.3208051

Evans, T., Barth, W., Browne, J., et al.: Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats. In: Proceedings of the First International Workshop on HPC User Support Tools, HUST ’14, New Orleans, Louisiana, USA, November 16-21, 2014. pp. 13–21. IEEE (2014). https://doi.org/10.1109/HUST.2014.7

Hammond, J.: Tacc stats: I/O performance monitoring for the instransigent. In: Invited Keynote for the 3rd IASDS Workshop. pp. 1–29. Austin, TX (2011)

Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting Distributed Performance Data with Dataheap: Generating and Exploiting a Holistic System View. Procedia Computer Science 9, 1969–1978 (2012). https://doi.org/10.1016/j.procs.2012.04.215

Kluge, M., Hartung, M.: Mapping of RAID Controller Performance Data to the Job History on Large Computing Systems. In: 2014 International Workshop on Data Intensive Scalable Computing Systems. pp. 73–80. New Orleans, Louisiana, USA (2014). http://conferences.computer.org/discs/2014/papers/7038a073.pdf

Lakshman, A., Malik, P.: Cassandra. ACM SIGOPS Operating Systems Review 44(2), 35–40 (2010). https://doi.org/10.1145/1773912.1773922

Li, J., Ali, G., Nguyen, N., et al.: MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems. In: IEEE International Conference on Cluster Computing, CLUSTER 2020. pp. 119–129. IEEE (2020). https://doi.org/10.1109/CLUSTER49012.2020.00022

Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30(7), 817–840 (2004). https://doi.org/10.1016/j.parco.2004.04.001

Mathur, W., Cook, J.: Improved Estimation for Software Multiplexing of Performance Counters. In: 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. vol. 2005, pp. 23–34. IEEE (2005). https://doi.org/10.1109/MASCOTS.2005.34

May, J.: MPX: Software for multiplexing hardware performance counters in multithreaded programs. In: Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. p. 8. IEEE (2001). https://doi.org/10.1109/IPDPS.2001.924955

McDonnel, K.: System Level Performance Management (1999). http://mirror.linux.org.au/pub/linux.conf.au/1999/

Minnich, R.G.: Supermon: High-Performance Monitoring for Linux Clusters. In: 5th Annual Linux Showcase & Conference 2001, Oakland, California, USA, November 5-10, 2001. USENIX Association, USA (2001)

Mooney, R., Schmidt, K., Studham, R.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935). pp. 379–389. IEEE (2004). https://doi.org/10.1109/CLUSTR.2004.1392637

Palmer, J.T., Gallo, S.M., Furlani, T.R., et al.: Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science & Engineering 17(4), 52–62 (2015). https://doi.org/10.1109/MCSE.2015.68

Rohl, T., Eitzinger, J., Hager, G., Wellein, G.: LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses. In: 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017. pp. 781–784. IEEE (2017). https://doi.org/10.1109/CLUSTER.2017.115

Roth, P., Arnold, D., Miller, B.: MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools. In: Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing. p. 21. IEEE (2003). https://doi.org/10.1109/SC.2003.10039

Shende, S.S., Malony, A.D.: The Tau Parallel Performance System. International Journal of High Performance Computing Applications 20(2), 287–311 (2006). https://doi.org/10.1177/1094342006064482

Shvets, P., Voevodin, V., Zhumatiy, S.: Primary Automatic Analysis of the Entire Flow of Supercomputer Applications. In: Proceedings of the 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists. pp. 20–32. CEUR Workshop Proceedings, Yekaterinburg (2018), http://ceur-ws.org/Vol-2281/

Shvets, P., Voevodin, V., Zhumatiy, S.: HPC Software for Massive Analysis of the Parallel Efficiency of Applications. In: Parallel Computational Technologies. PCT 2019. Communications in Computer and Information Science, vol. 1063, pp. 3–18. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28163-2_1

Solis, A.J., Foss, G., Jansen, C., Stelmaszek, M.: VisQueue. In: Practice and Experience in Advanced Research Computing. pp. 293–298. ACM, New York, NY, USA (2020). https://doi.org/10.1145/3311790.3396618

Sottile, M., Minnich, R.: Supermon: a high-speed cluster monitoring system. In: 2002 IEEE International Conference on Cluster Computing, CLUSTER 2002. pp. 39–46. IEEE (2002). https://doi.org/10.1109/CLUSTR.2002.1137727

Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). In: Sloot, P., Boukhanovsky, A., Athanassoulis, G., Klimentov, A. (eds.) 4th International Young Scientist Conference on Computational Science. Procedia Computer Science, vol. 66, pp. 625–634. Elsevier B.V. (2015). https://doi.org/10.1016/j.procs.2015.11.071

Treibig, J., Hager, G., Wellein, G.: LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In: 2010 39th International Conference on Parallel Processing Workshops. pp. 207–216. IEEE (2010). https://doi.org/10.1109/ICPPW.2010.38

Watson, G.R., Frings, W., Knobloch, C., et al.: Scalable Control and Monitoring of Supercomputer Applications Using an Integrated Tool Framework. In: 2011 40th International Conference on Parallel Processing Workshops. pp. 457–466. IEEE (2011). https://doi.org/10.1109/ICPPW.2011.53

Yasin, A.: A Top-Down method for performance analysis and counters architecture. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. pp. 35–44. IEEE (2014). https://doi.org/10.1109/ISPASS.2014.6844459