A Review of Supercomputer Performance Monitoring Systems
DOI:
https://doi.org/10.14529/jsfi210304Keywords:
monitoring, supercomputers, performance monitoring, reviewAbstract
High Performance Computing is now one of the emerging fields in computer science and its applications. Top HPC facilities, supercomputers, offer great opportunities in modeling diverse processes thus allowing to create more and greater products without full-scale experiments. Current supercomputers and applications for them are very complex and thus are hard to use efficiently. Performance monitoring systems are the tools that help to understand the efficiency of supercomputing applications and overall supercomputer functioning. These systems collect data on what happens on a supercomputer (performance data, performance metrics) and present them in a way allowing to make conclusions about performance issues in programs running on the supercomputer. In this paper we give an overview of existing performance monitoring systems designed for or used on supercomputers. We give a comparison of performance monitoring systems found in literature, describe problems emerging in monitoring large scale HPC systems, and outline our vision on future direction of HPC monitoring systems development.
References
CLUMON. https://web.archive.org/web/20090517125016/http://clumon.ncsa.uiuc.edu/, accessed: 2021-06-16
Data Center GPU Manager Documentation. https://docs.nvidia.com/datacenter/dcgm/latest/index.html, accessed: 2021-07-19
ExaMon | Exascale Monitoring Framework for HPC. http://projects.eees.dei.unibo.it/monitoring/wordpress/, accessed: 2021-06-08
Ganglia Monitoring System. http://ganglia.info/, accessed: 2021-06-09
GitHub - ovis-hpc/sos: sos pre-release stable. https://github.com/ovis-hpc/sos, accessed: 2021-06-11
Grafana - The open platform for analytics and monitoring. https://grafana.com/, accessed: 2021-06-11
InfluxData (InfluxDB) | Time Series Database Monitoring & Analytics. https://www.influxdata.com/, accessed: 2021-03-17
KairosDB. https://kairosdb.github.io/, accessed: 2021-08-24
Kibana: Explore, Visualize, Discover Data | Elastic. https://www.elastic.co/kibana/, accessed: 2021-07-19
Mpp2 - Cluster Platform 6000 rx2600 Itanium2 1.5 GHz, Quadrics | TOP500. https://www.top500.org/system/173082/, accessed: 2021-05-28
MQTT - The Standard for IoT Messaging. https://mqtt.org/, accessed: 2021-06-17
MySQL. https://www.mysql.com/, accessed: 2021-06-11
Nagios - The Industry Standard in IT Infrastructure Monitoring. http://www.nagios.org/, accessed: 2021-06-18
National Supercomputing Mission. https://nsmindia.in/, accessed: 2021-07-19
Open XDMoD. https://open.xdmod.org/9.5/index.html, accessed: 2021-06-11
OVISWiki. https://ovis.ca.sandia.gov/index.php/Main_Page, accessed: 2021-06-11
Performance Co-Pilot. http://pcp.io/, accessed: 2021-06-11
PostgreSQL: The world’s most advanced open source database. https://www.postgresql.org/, accessed: 2021-06-23
Redash helps you make sense of your data. https://redash.io/, accessed: 2021-06-17
RRDtool - About RRDtool. http://oss.oetiker.ch/rrdtool/, accessed: 2021-06-11
The most popular database for modern apps | MongoDB. https://www.mongodb.com/, accessed: 2021-06-17
Aaziz, O., Cook, J., Sharifi, H.: Push Me Pull You: Integrating Opposing Data Transport Modes for Efficient HPC Application Monitoring. In: 2015 IEEE International Conference on Cluster Computing. pp. 674–681. IEEE (2015). https://doi.org/10.1109/CLUSTER.2015.118
Adhianto, L., Banerjee, S., Fagan, M., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22(6), 685–701 (2010). https://doi.org/10.1002/cpe.1553
Agelastos, A., Allan, B., Brandt, J., et al.: The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014. pp. 154–165. IEEE (2014). https://doi.org/10.1109/SC.2014.18
Agrawal, K., Fahey, M.R., McLay, R., James, D.: User Environment Tracking and Problem Detection with XALT. In: 2014 First International Workshop on HPC User Support Tools. pp. 32–40. IEEE (2014). https://doi.org/10.1109/HUST.2014.6
Brandt, J.M., Debusschere, B.J., Gentile, A.C., et al.: Ovis-2: A robust distributed architecture for scalable RAS. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. pp. 1–8. IEEE (2008). https://doi.org/10.1109/IPDPS.2008.4536549
Browne, J.C., DeLeon, R.L., Lu, C.D., et al.: Enabling comprehensive data-driven system management for large computational facilities. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. pp. 1–11. ACM, New York, NY, USA (2013). https://doi.org/10.1145/2503210.2503230
Buyya, R.: PARMON: a portable and scalable monitoring system for clusters. Software: Practice and Experience 30(7), 723–739 (2000). https://doi.org/10.1002/(SICI)1097-024X(200006)30:7<723::AID-SPE314>3.0.CO;2-5
Byford, N., Popov, S., Paterson, A.: Anomaly Detection in High Performance Computing Systems. In: Kos, L. (ed.) Summer of HPC 2020, pp. 12–14 (2020). https://summerofhpc.prace-ri.eu/wp-content/uploads/2020/12/SoHPC2020-reports.pdf
Das, A., Mueller, F., Siegel, C., Vishnu, A.: Desh: deep learning for system health prediction of lead times to failure in HPC. In: HPDC’18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. pp. 40–51. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3208040.3208051
Evans, T., Barth, W., Browne, J., et al.: Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats. In: Proceedings of the First International Workshop on HPC User Support Tools, HUST ’14, New Orleans, Louisiana, USA, November 16-21, 2014. pp. 13–21. IEEE (2014). https://doi.org/10.1109/HUST.2014.7
Hammond, J.: Tacc stats: I/O performance monitoring for the instransigent. In: Invited Keynote for the 3rd IASDS Workshop. pp. 1–29. Austin, TX (2011)
Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting Distributed Performance Data with Dataheap: Generating and Exploiting a Holistic System View. Procedia Computer Science 9, 1969–1978 (2012). https://doi.org/10.1016/j.procs.2012.04.215
Kluge, M., Hartung, M.: Mapping of RAID Controller Performance Data to the Job History on Large Computing Systems. In: 2014 International Workshop on Data Intensive Scalable Computing Systems. pp. 73–80. New Orleans, Louisiana, USA (2014). http://conferences.computer.org/discs/2014/papers/7038a073.pdf
Lakshman, A., Malik, P.: Cassandra. ACM SIGOPS Operating Systems Review 44(2), 35–40 (2010). https://doi.org/10.1145/1773912.1773922
Li, J., Ali, G., Nguyen, N., et al.: MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems. In: IEEE International Conference on Cluster Computing, CLUSTER 2020. pp. 119–129. IEEE (2020). https://doi.org/10.1109/CLUSTER49012.2020.00022
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30(7), 817–840 (2004). https://doi.org/10.1016/j.parco.2004.04.001
Mathur, W., Cook, J.: Improved Estimation for Software Multiplexing of Performance Counters. In: 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. vol. 2005, pp. 23–34. IEEE (2005). https://doi.org/10.1109/MASCOTS.2005.34
May, J.: MPX: Software for multiplexing hardware performance counters in multithreaded programs. In: Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. p. 8. IEEE (2001). https://doi.org/10.1109/IPDPS.2001.924955
McDonnel, K.: System Level Performance Management (1999). http://mirror.linux.org.au/pub/linux.conf.au/1999/
Minnich, R.G.: Supermon: High-Performance Monitoring for Linux Clusters. In: 5th Annual Linux Showcase & Conference 2001, Oakland, California, USA, November 5-10, 2001. USENIX Association, USA (2001)
Mooney, R., Schmidt, K., Studham, R.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935). pp. 379–389. IEEE (2004). https://doi.org/10.1109/CLUSTR.2004.1392637
Palmer, J.T., Gallo, S.M., Furlani, T.R., et al.: Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science & Engineering 17(4), 52–62 (2015). https://doi.org/10.1109/MCSE.2015.68
Rohl, T., Eitzinger, J., Hager, G., Wellein, G.: LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses. In: 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017. pp. 781–784. IEEE (2017). https://doi.org/10.1109/CLUSTER.2017.115
Roth, P., Arnold, D., Miller, B.: MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools. In: Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing. p. 21. IEEE (2003). https://doi.org/10.1109/SC.2003.10039
Shende, S.S., Malony, A.D.: The Tau Parallel Performance System. International Journal of High Performance Computing Applications 20(2), 287–311 (2006). https://doi.org/10.1177/1094342006064482
Shvets, P., Voevodin, V., Zhumatiy, S.: Primary Automatic Analysis of the Entire Flow of Supercomputer Applications. In: Proceedings of the 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists. pp. 20–32. CEUR Workshop Proceedings, Yekaterinburg (2018), http://ceur-ws.org/Vol-2281/
Shvets, P., Voevodin, V., Zhumatiy, S.: HPC Software for Massive Analysis of the Parallel Efficiency of Applications. In: Parallel Computational Technologies. PCT 2019. Communications in Computer and Information Science, vol. 1063, pp. 3–18. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28163-2_1
Solis, A.J., Foss, G., Jansen, C., Stelmaszek, M.: VisQueue. In: Practice and Experience in Advanced Research Computing. pp. 293–298. ACM, New York, NY, USA (2020). https://doi.org/10.1145/3311790.3396618
Sottile, M., Minnich, R.: Supermon: a high-speed cluster monitoring system. In: 2002 IEEE International Conference on Cluster Computing, CLUSTER 2002. pp. 39–46. IEEE (2002). https://doi.org/10.1109/CLUSTR.2002.1137727
Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). In: Sloot, P., Boukhanovsky, A., Athanassoulis, G., Klimentov, A. (eds.) 4th International Young Scientist Conference on Computational Science. Procedia Computer Science, vol. 66, pp. 625–634. Elsevier B.V. (2015). https://doi.org/10.1016/j.procs.2015.11.071
Treibig, J., Hager, G., Wellein, G.: LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In: 2010 39th International Conference on Parallel Processing Workshops. pp. 207–216. IEEE (2010). https://doi.org/10.1109/ICPPW.2010.38
Watson, G.R., Frings, W., Knobloch, C., et al.: Scalable Control and Monitoring of Supercomputer Applications Using an Integrated Tool Framework. In: 2011 40th International Conference on Parallel Processing Workshops. pp. 457–466. IEEE (2011). https://doi.org/10.1109/ICPPW.2011.53
Yasin, A.: A Top-Down method for performance analysis and counters architecture. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. pp. 35–44. IEEE (2014). https://doi.org/10.1109/ISPASS.2014.6844459
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.