Performance and Power Analysis of a Vector Computing System

Kazuhiko Komatsu; Akito Onodera; Erich Focht; Soya Fujimoto; Yoko Isobe; Shintaro Momose; Masayuki Sato; Hiroaki Kobayashi

doi:10.14529/jsfi210205

Authors

Kazuhiko Komatsu Tohoku University
Akito Onodera Tohoku University
Erich Focht NEC Deutschland GmbH
Soya Fujimoto NEC Corporation
Yoko Isobe NEC Corporation
Shintaro Momose NEC Corporation
Masayuki Sato Tohoku University
Hiroaki Kobayashi Tohoku University

DOI:

https://doi.org/10.14529/jsfi210205

Abstract

The performance of recent computing systems has drastically improved due to the increase in the number of cores. However, this approach is reaching the limitation due to the power constraints of facilities. Instead, this paper focuses on a vector processing with long vector length that has a potential to realize high performance and high power efficiency. This paper discusses the potential through the optimization of two benchmarks, the Himeno and HPCG benchmarks, for the latest vector computing system SX-Aurora TSUBASA. The architecture of SX-Aurora TSUBASA owes the high efficiency to making good of its long vector length. Considering these characteristics, various levels of optimizations required for a large-scale vector computing system are examined such as vectorization, loop unrolling, use of cache, domain decomposition, process mapping, and problem size tuning. The evaluation and analysis suggest that the optimizations improve the sustained performance, power efficiency, and scalability of both benchmarks. Therefore, it is clarified that the SX-Aurora TSUBASA architecture can achieve higher power efficiency due to its high sustained memory bandwidth paired with the long vector computing.

References

Himeno benchmark. http://i.riken.jp/en/supercom/documents/himenobmt/, accessed: 2021-05-31

HPCG benchmark. https://www.hpcg-benchmark.org/, accessed: 2021-05-31

MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/benchmarks/, accessed: 2021-05-31

STREAM: Sustainable Memory Bandwidth in High Performance Computers. https://www.cs.virginia.edu/stream/, accessed: 2021-05-31

TOP500 Supercomputer Sites, http://www.top500.org/

Anzt, H., Tsai, Y.M., Abdelfattah, A., et al.: Evaluating the performance of NVIDIAs A100 ampere GPU for sparse and batched computations. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). pp. 26–38. IEEE (2020). https://doi.org/10.1109/PMBS51919.2020.00009

Cho, J.H., Kim, J., Lee, W.Y., et al.: A 1.2V 64Gb 341GB/S HBM2 stacked DRAM with spiral point-to-point TSV structure and improved bank group data control. In: 2018 IEEE International Solid - State Circuits Conference - (ISSCC). pp. 208–210. IEEE (2018). https://doi.org/10.1109/ISSCC.2018.8310257

Choquette, J., Gandhi, W.: NVIDIA A100 GPU: Performance innovation for GPU computing. In: 2020 IEEE Hot Chips 32 Symposium (HCS). pp. 1–43. IEEE (2020). https://doi.org/10.1109/HCS49909.2020.9220622

Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. The International Journal of High Performance Computing Applications 30(1), 3–10 (2016). https://doi.org/10.1177/1094342015593158

Egawa, R., Komatsu, K., Takizawa, H.: Designing an open database of system-aware code optimizations. In: 2017 Fifth International Symposium on Computing and Networking (CANDAR). pp. 369–374. IEEE Computer Society (2017). https://doi.org/10.1109/CANDAR.2017.102

Egawa, R., Fujimoto, S., Yamashita, T., et al.: Exploiting the potentials of the second generation SX-Aurora TSUBASA. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). pp. 39–49. IEEE (2020). https://doi.org/10.1109/PMBS51919.2020.00010

Egawa, R., Komatsu, K., Isobe, Y., et al.: Performance and power analysis of SX-ACE using HP-X benchmark programs. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). pp. 693–700. IEEE Computer Society (2017). https://doi.org/10.1109/CLUSTER.2017.65

Egawa, R., Komatsu, K., Kobayashi, H.: Designing an HPC refactoring catalog toward the exa-scale computing era. In: Resch, M.M., Bez, W., Focht, E., Kobayashi, H., Patel, N. (eds.) Sustained Simulation Performance 2014. pp. 91–98. Springer (2015). https://doi.org/10.1007/978-3-319-10626-7_8

Egawa, R., Komatsu, K., Momose, S., et al.: Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. The Journal of Supercomputing 73(9), 3948–3976 (2017). https://doi.org/10.1007/s11227-017-1993-y

Egawa, R., Momose, S., Komatsu, K., Isobe, Y., Musa, A., Takizawa, H., Kobayashi, H.: Early evaluation of the SX-ACE processor. In: The poster at International Conference for High Performance Computing, Networking, Storage and Analysis (SC14) (2014)

Focht, E.: HPCG Performance Efficiency on VE at 5.99%. https://sx-aurora.github.io/posts/hpcg-tuning/ (2019), accessed: 2021-06-09

Heroux, M.A., Dongarra, J., Luszczek, P.: HPCG benchmark technical specification (2013). https://doi.org/10.2172/1113870

Hou, S.Y., Chen, W.C., Hu, C., et al.: Wafer-level integration of an advanced logic-memory system through the second-generation CoWoS technology. IEEE Transactions on Electron Devices 64(10), 4071–4077 (2017). https://doi.org/10.1109/TED.2017.2737644

Komatsu, K., Egawa, R., Hirasawa, S., et al.: Migration of an atmospheric simulation code to an OpenACC platform using the Xevolver framework. In: 2015 Third International Symposium on Computing and Networking (CANDAR). pp. 515–520. IEEE Computer Society (2015). https://doi.org/10.1109/CANDAR.2015.102

Komatsu, K., Egawa, R., Hirasawa, S., et al.: Translation of large-scale simulation codes for an OpenACC platform using the Xevolver framework. International Journal of Networking and Computing 6(2), 167–180 (2016). https://doi.org/10.15803/ijnc.6.2_167

Komatsu, K., Egawa, R., Isobe, Y., et al.: An approach to the highest efficiency of the HPCG benchmark on the SX-ACE supercomputer. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC15), Poster. pp. 1–2 (2015)

Komatsu, K., Egawa, R., Takizawa, H., et al.: Exploring system architectures for nextgeneration CFD simulations in the postpeta-scale era. Journal of Fluid Science and Technology 9(5), JFST0073–JFST0073 (2014). https://doi.org/10.1299/jfst.2014jfst0073

Komatsu, K., Kishitani, T., Sato, M., et al.: An appropriate computing system and its system parameters selection based on bottleneck prediction of applications. In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 768–777. IEEE (2019). https://doi.org/10.1109/IPDPSW.2019.00127

Komatsu, K., Momose, S., Isobe, Y., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. pp. 54:1–54:12. SC ’18, IEEE Press (2018). https://doi.org/10.1109/SC.2018.00057

Liu, Y., Yang, C., Liu, F., et al.: 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. The International Journal of High Performance Computing Applications 30(1), 39–54 (2016). https://doi.org/10.1177/1094342015616266

Oh, C.S., Chun, K.C., Byun, Y.Y., et al.: 22.1A 1.1V 16GB 640GB/s HBM2E DRAM with a Data-Bus Window-Extension Technique and a Synergetic On-Die ECC Scheme. In: 2020 IEEE International Solid- State Circuits Conference - (ISSCC). pp. 330–332. IEEE (2020). https://doi.org/10.1109/ISSCC19947.2020.9063110

Onodera, A., Komatsu, K., Fujimoto, S., et al.: Optimization of the himeno benchmark for SX-Aurora TSUBASA. In: Wolf, F., Gao, W. (eds.) Benchmarking, Measuring, and Optimizing. Lecture Notes in Computer Science, vol. 12614, pp. 127–143. Springer (2021). https://doi.org/10.1007/978-3-030-71058-3_8

Park, J., Smelyanskiy, M., Vaidyanathan, K., et al.: Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors. The International Journal of High Performance Computing Applications 30(1), 11–27 (2016). https://doi.org/10.1177/1094342015593157

Phillips, E., Fatica, M.: Performance analysis of the high-performance conjugate gradient benchmark on GPUs. The International Journal of High Performance Computing Applications 30(1), 28–38 (2016). https://doi.org/10.1177/1094342015599239

Yamada, Y., Momose, S.: Vector engine processor of NEC’s brand-new supercomputer SX-Aurora TSUBASA. In: International symposium on High Performance Chips (Hot Chips2018) (2018)