Administration, Monitoring and Analysis of Supercomputers in Russia: a Survey of 10 HPC Centers

Authors

  • Vadim V. Voevodin M.V. Lomonosov Moscow State University
  • Roman A. Chulkevich HSE University
  • Pavel S. Kostenetskiy HSE University
  • Vyacheslav I. Kozyrev HSE University
  • Anton K. Maliutin Skolkovo Institute of Science and Technology (Skoltech)
  • Dmitry A. Nikitenko M.V. Lomonosov Moscow State University
  • Sergey G. Rykovanov Skolkovo Institute of Science and Technology (Skoltech)
  • Artemiy B. Shamsutdinov HSE University
  • Yurii N. Shkandybin Skolkovo Institute of Science and Technology (Skoltech)
  • Sergey A. Zhumatiy M.V. Lomonosov Moscow State University

DOI:

https://doi.org/10.14529/jsfi210305

Keywords:

supercomputer, high-performance computing, administration, survey, monitoring, performance

Abstract

Supercomputer technologies are in demand for solving many important and computationallyintensive tasks in various fields of science and technology. Therefore, it is not surprising that there are several dozen supercomputer centers only in Russia. However, the goals of creating such centers, as well as the range of tasks solved in them, can vary greatly, therefore the structure of supercomputers and the policies for their usage can significantly differ. This leads to the fact that many supercomputer centers live an isolated life – the administrators of such centers tend to solve administration-related tasks on their own, despite the fact that solutions for many similar tasks have already been developed and applied in other centers. This can happen due to different reasons, but in any case, this situation could and should be improved. To do this, it is worth establishing a closer connection between supercomputer centers, which will allow more actively exchanging experience or jointly developing desired system software. In order to understand the current situation in this area, a survey was conducted of representatives among 10 large supercomputer centers in Russia, and its results are presented in this paper. Two relevant topics about using monitoring data in practice and real-life examples of supercomputer functioning improvement are also discussed here in more detail. Their vision on these topics is provided by the system administrators of HSE University, Skoltech and Moscow State University.

References

Balerter homepage. https://balerter.com/0.8.1/getting_started/about.html, accessed: 2021-08-26

Grafana: The open observability platform. https://grafana.com/, accessed: 2021-08-26

The working group on the analysis and quality assurance of supercomputer center functioning. https://scc-efficiency.parallel.ru/, accessed: 2021-08-26

VictoriaMetrics documentation. https://docs.victoriametrics.com/, accessed: 2021-08-26

Presentation with final survey results (in Russian). Tech. rep. (2021), https://scc-efficiency.parallel.ru/assets/final_scc_survey.pdf

Top 50 supercomputers list. http://top50.supercomputers.ru/list (2021), accessed: 2021-08-26

Abraham, M.J., Murtola, T., Schulz, R., et al.: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1-2, 19–25 (2015). https://doi.org/10.1016/j.softx.2015.06.001

Community, E.B.: Jupyter Book (2020). https://doi.org/10.5281/zenodo.4539666

Deneroff, M.M., Shaw, D.E., Dror, R.O., et al.: Anton: A specialized ASIC for molecular dynamics. In: 2008 IEEE Hot Chips 20 Symposium (HCS). pp. 1–34 (2008). https://doi.org/10.1109/HOTCHIPS.2008.7476542

Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., 1st edn. (2015)

Joseph, E., Conway, S.: Major Trends in the Worldwide HPC Market. Tech. rep. (2017), https://hpcuserforum.com/presentations/stuttgart2017/IDC-update-HLRS.pdf

Kostenetskiy, P.S., Chulkevich, R.A., Kozyrev, V.I.: HPC Resources of the Higher School of Economics. Journal of Physics: Conference Series 1740, 012050 (2021). https://doi.org/10.1088/1742-6596/1740/1/012050

Nikitenko, D., Antonov, A., Shvets, P., et al.: JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis. In: Supercomputing. Third Russian Supercomputing Days, RuSCDays 2017, Moscow, Russia, September 25-26, 2017, Revised Selected Papers. pp. 516–529. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71255-0_42

Ott, M., Shin, W., Bourassa, N., et al.: Global Experiences with HPC Operational Data Measurement, Collection and Analysis. In: IEEE International Conference on Cluster Computing, CLUSTER 2020. pp. 499–508. IEEE (2020). https://doi.org/10.1109/CLUSTER49012.2020.00071

Phillips, J.C., Braun, R., Wang, W., et al.: Scalable molecular dynamics with NAMD. Journal of Computational Chemistry 26(16), 1781–1802 (2005). https://doi.org/10.1002/jcc.20289

Shaikhislamov, D., Voevodin, V.: Solving the problem of detecting similar supercomputer applications using machine learning methods. In: Parallel Computational Technologies, PCT 2020. Communications in Computer and Information Science, vol. 1263, pp. 46–57. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55326-5_4

Shvets, P., Voevodin, V., Nikitenko, D.: Approach to Workload Analysis of Large HPC Centers. In: Parallel Computational Technologies, PCT 2020. Communications in Computer and Information Science, vol. 1263, pp. 16–30. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55326-5_2

Stefanov, K., Voevodin, V., Zhumatiy, S., et al.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science 66, 625–634 (2015). https://doi.org/10.1016/j.procs.2015.11.071

Sterling, T., Anderson, M., Brodowicz, M.: High Performance Computing: Modern Systems and Practices. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edn. (2017). https://doi.org/10.1016/C2013-0-09704-6

Terpstra, D., Jagode, H., You, H., et al.: Collecting performance data with PAPI-C. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009. pp. 157–173. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11261-4_11

Voevodin, V.V., Antonov, A.S., Nikitenko, D.A., et al.: Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community. Supercomputing Frontiers and Innovations 6(2), 4–11 (2019). https://doi.org/10.14529/jsfi190201

Yoo, A.B., Jette, M.A., Grondona, M.: Slurm: Simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. pp. 44–60. Springer, Berlin, Heidelberg (2003). https://doi.org/10.1007/10968987_3

Zacharov, I., Arslanov, R., Gunin, M., et al.: "Zhores" – Petaflops supercomputer for datadriven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Engineering 9(1), 512–520 (2019). https://doi.org/10.1515/eng-2019-0059

Zacharov, I., Panarin, O., Rykovanov, S., et al.: Monitoring applications on the ZHORES cluster at Skoltech. Program Systems: Theory and Applications 12(2), 73–103 (2021). https://doi.org/10.25209/2079-3316-2021-12-2-73-103

Downloads

Published

2021-10-20

How to Cite

Voevodin, V. V., Chulkevich, R. A., Kostenetskiy, P. S., Kozyrev, V. I., Maliutin, A. K., Nikitenko, D. A., Rykovanov, S. G., Shamsutdinov, A. B., Shkandybin, Y. N., & Zhumatiy, S. A. (2021). Administration, Monitoring and Analysis of Supercomputers in Russia: a Survey of 10 HPC Centers. Supercomputing Frontiers and Innovations, 8(3), 82–103. https://doi.org/10.14529/jsfi210305

Most read articles by the same author(s)