Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community
DOI:
https://doi.org/10.14529/jsfi190201Abstract
The huge number of hardware and software components, together with a large number of parameters affecting the performance of each parallel application, makes ensuring the efficiency of a large scale supercomputer extremely difficult. In this situation, all basic parameters of the supercomputer should be constantly monitored, as well as many decisions about its functioning should be made by special software automatically. In this paper we describe the tight connection between complexity of modern large high performance computing systems and special techniques and tools required to ensure their efficiency in practice. The main subsystems of the developed complex (Octoshell, DiMMoN, Octotron, JobDigest, and an expert software system to bring fine analytics on parallel applications and the entire supercomputer to users and sysadmins) are actively operated on the large supercomputer systems at Lomonosov Moscow State University. A brief description of the architecture of Lomonosov-2 supercomputer is presented, and questions showing both a wide variety of emerging complex issues and the need for an integrated approach to solving the problem of effectively supporting large supercomputer systems are discussed.References
Strela (in Russian). http://www.computer-museum.ru/histussr/strela0.htm, accessed: 2019-06-20
Sadovnichy, V., Tikhonravov, A., Voevodin, Vl., Opanasenko, V.: “Lomonosov”: Supercomputing at Moscow State University. In: Contemporary High Performance Computing: From Petascale toward Exascale (Chapman & Hall/CRC Computational Science), pp. 283–307. Boca Raton, USA, CRC Press (2013)
Dongarra, J., Beckman, P. et al.: The International Exascale Software Roadmap. International Journal of High Performance Computer Applications 25(1), 3–60 (2011), DOI: 10.1177/1094342010391989
TOP500 Supercomputer Sites. https://www.top500.org/, accessed: 2019-06-20
Top50 supercomputers of Russia (in Russian). http://top50.supercomputers.ru/, accessed: 2019-06-20
Slurm workload manager. http://slurm.schedmd.com/slurm.html, accessed: 2019-06-20
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, Vad., Voevodin, Vl., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics. 11th International Conference, PPAM 2015, Krakow, Poland, September 6–9, 2015. Revised Selected Papers, Part I Lecture Notes in Computer Science, vol. 9573, pp. 12–22. Springer International Publishing (2016), DOI: 10.1007/978-3-319-32149-3_2
Agrawal, K., Fahey, M.R., McLay, R., James, D.: User environment tracking and problem detection with XALT. In: Proceedings of the First International Workshop on HPC User Support Tools, 21–21 Nov. 2014, New Orleans, LA, USA. pp. 32–40. IEEE Press (2014), DOI: 10.1109/HUST.2014.6
McLay, R.: Lmod: Environmental Modules System. http://www.tacc.utexas.edu/tacc-projects/lmod, accessed: 2019-06-20
Nikitenko, D., Voevodin, Vl., Zhumatiy, S.: Resolving frontier problems of mastering largescale supercomputer complexes. In: Proceedings of the ACM International Conference on Computing Frontiers (CF’16), May 16–19, 2016, Como, Italy. pp. 349–352. ACM New York, NY, USA (2016), DOI: 10.1145/2903150.2903481
Stefanov, K., Voevodin, Vad., Zhumatiy, S., Voevodin, Vl.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). In: 4th International Young Scientist Conference on Computational Science. Procedia Computer Science, vol. 66, pp. 625–634. Elsevier B.V Netherlands (2015), DOI: 10.1016/j.procs.2015.11.071
Nikitenko, D., Antonov, A., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, Vad., Voevodin, Vl., Zhumatiy, S.: Jobdigest — detailed system monitoring-based supercomputer application behavior analysis. In: Third Russian Supercomputing Days, RuSCDays 2017, Moscow, Russia, September 25–26, 2017, Revised Selected Papers. Communications in Computer and Information Science (CCIS), vol. 793, pp. 516–529. Springer Cham (2017), DOI: 10.1007/978-3-319-71255-0_42
Nikitenko, D., Shvets, P., Voevodin, Vad., Zhumatiy, S.: Role-dependent resource utilization analysis for large HPC centers. In: Parallel Computational Technologies. Communications in Computer and Information Science (CCIS), April 2–6, 2018, Rostov-on-Don, Russia. vol. 910, pp. 47–61. Springer (2018), DOI: 10.1007/978-3-319-99673-8_4
Shaykhislamov, D., Voevodin, Vad.: An approach for detecting abnormal parallel applications based on time series analysis methods. In: Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, September 10–13, 2017, Lublin, Poland. vol. 10777, pp. 359–369. Springer International Publishing (2018), DOI: 10.1007/978-3-319-78024-5_32
Nikitenko, D., Voevodin, Vad., Zhumatiy, S.: Deep analysis of job state statistics on Lomonosov-2 supercomputer. Supercomputing Frontiers and Innovations, 5(2), 4–10 (2018), DOI: 10.14529/jsfi180201
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.