Administration, Monitoring and Analysis of Supercomputers in Russia: a Survey of 10 HPC Centers
DOI:
https://doi.org/10.14529/jsfi210305Keywords:
supercomputer, high-performance computing, administration, survey, monitoring, performanceAbstract
Supercomputer technologies are in demand for solving many important and computationallyintensive tasks in various fields of science and technology. Therefore, it is not surprising that there are several dozen supercomputer centers only in Russia. However, the goals of creating such centers, as well as the range of tasks solved in them, can vary greatly, therefore the structure of supercomputers and the policies for their usage can significantly differ. This leads to the fact that many supercomputer centers live an isolated life – the administrators of such centers tend to solve administration-related tasks on their own, despite the fact that solutions for many similar tasks have already been developed and applied in other centers. This can happen due to different reasons, but in any case, this situation could and should be improved. To do this, it is worth establishing a closer connection between supercomputer centers, which will allow more actively exchanging experience or jointly developing desired system software. In order to understand the current situation in this area, a survey was conducted of representatives among 10 large supercomputer centers in Russia, and its results are presented in this paper. Two relevant topics about using monitoring data in practice and real-life examples of supercomputer functioning improvement are also discussed here in more detail. Their vision on these topics is provided by the system administrators of HSE University, Skoltech and Moscow State University.
References
Balerter homepage. https://balerter.com/0.8.1/getting_started/about.html, accessed: 2021-08-26
Grafana: The open observability platform. https://grafana.com/, accessed: 2021-08-26
The working group on the analysis and quality assurance of supercomputer center functioning. https://scc-efficiency.parallel.ru/, accessed: 2021-08-26
VictoriaMetrics documentation. https://docs.victoriametrics.com/, accessed: 2021-08-26
Presentation with final survey results (in Russian). Tech. rep. (2021), https://scc-efficiency.parallel.ru/assets/final_scc_survey.pdf
Top 50 supercomputers list. http://top50.supercomputers.ru/list (2021), accessed: 2021-08-26
Abraham, M.J., Murtola, T., Schulz, R., et al.: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1-2, 19–25 (2015). https://doi.org/10.1016/j.softx.2015.06.001
Community, E.B.: Jupyter Book (2020). https://doi.org/10.5281/zenodo.4539666
Deneroff, M.M., Shaw, D.E., Dror, R.O., et al.: Anton: A specialized ASIC for molecular dynamics. In: 2008 IEEE Hot Chips 20 Symposium (HCS). pp. 1–34 (2008). https://doi.org/10.1109/HOTCHIPS.2008.7476542
Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., 1st edn. (2015)
Joseph, E., Conway, S.: Major Trends in the Worldwide HPC Market. Tech. rep. (2017), https://hpcuserforum.com/presentations/stuttgart2017/IDC-update-HLRS.pdf
Kostenetskiy, P.S., Chulkevich, R.A., Kozyrev, V.I.: HPC Resources of the Higher School of Economics. Journal of Physics: Conference Series 1740, 012050 (2021). https://doi.org/10.1088/1742-6596/1740/1/012050
Nikitenko, D., Antonov, A., Shvets, P., et al.: JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis. In: Supercomputing. Third Russian Supercomputing Days, RuSCDays 2017, Moscow, Russia, September 25-26, 2017, Revised Selected Papers. pp. 516–529. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71255-0_42
Ott, M., Shin, W., Bourassa, N., et al.: Global Experiences with HPC Operational Data Measurement, Collection and Analysis. In: IEEE International Conference on Cluster Computing, CLUSTER 2020. pp. 499–508. IEEE (2020). https://doi.org/10.1109/CLUSTER49012.2020.00071
Phillips, J.C., Braun, R., Wang, W., et al.: Scalable molecular dynamics with NAMD. Journal of Computational Chemistry 26(16), 1781–1802 (2005). https://doi.org/10.1002/jcc.20289
Shaikhislamov, D., Voevodin, V.: Solving the problem of detecting similar supercomputer applications using machine learning methods. In: Parallel Computational Technologies, PCT 2020. Communications in Computer and Information Science, vol. 1263, pp. 46–57. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55326-5_4
Shvets, P., Voevodin, V., Nikitenko, D.: Approach to Workload Analysis of Large HPC Centers. In: Parallel Computational Technologies, PCT 2020. Communications in Computer and Information Science, vol. 1263, pp. 16–30. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55326-5_2
Stefanov, K., Voevodin, V., Zhumatiy, S., et al.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science 66, 625–634 (2015). https://doi.org/10.1016/j.procs.2015.11.071
Sterling, T., Anderson, M., Brodowicz, M.: High Performance Computing: Modern Systems and Practices. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edn. (2017). https://doi.org/10.1016/C2013-0-09704-6
Terpstra, D., Jagode, H., You, H., et al.: Collecting performance data with PAPI-C. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009. pp. 157–173. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11261-4_11
Voevodin, V.V., Antonov, A.S., Nikitenko, D.A., et al.: Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community. Supercomputing Frontiers and Innovations 6(2), 4–11 (2019). https://doi.org/10.14529/jsfi190201
Yoo, A.B., Jette, M.A., Grondona, M.: Slurm: Simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. pp. 44–60. Springer, Berlin, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Zacharov, I., Arslanov, R., Gunin, M., et al.: "Zhores" – Petaflops supercomputer for datadriven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Engineering 9(1), 512–520 (2019). https://doi.org/10.1515/eng-2019-0059
Zacharov, I., Panarin, O., Rykovanov, S., et al.: Monitoring applications on the ZHORES cluster at Skoltech. Program Systems: Theory and Applications 12(2), 73–103 (2021). https://doi.org/10.25209/2079-3316-2021-12-2-73-103
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.