Distributed Graph Algorithms for Multiple Vector Engines of NEC SX-Aurora TSUBASA Systems

Ilya V. Afanasyev; Vadim V. Voevodin; Kazuhiko Komatsu; Hiroaki Kobayashi

doi:10.14529/jsfi210206

Authors

Ilya V. Afanasyev Lomonosov Moscow State University
Vadim V. Voevodin Lomonosov Moscow State University
Kazuhiko Komatsu Tohoku University
Hiroaki Kobayashi Tohoku University

DOI:

https://doi.org/10.14529/jsfi210206

Abstract

This paper describes the world-first attempt to develop distributed graph algorithm implementations, aimed for modern NEC SX-Aurora TSUBASA vector systems. Such systems are equipped with up to eight powerful vector engines, which are capable to significantly accelerate graph processsing and simultaneously increase the scale of processed input graphs. This paper describes distributed implementations of three widely-used graph algorithms: Page Rank (PR), Bellman-Ford Single Source Shortest Paths (further referred as SSSP) and Hyperlink-Induced Topic Search (HITS), evaluating their performance and scalability on Aurora 8 system. In this paper we describe graph partitioning strategies, communication strategies, programming models and single-VE optimizations used in these implementations. The developed implementations achieve 40, 6.6 and 1.3 GTEPS performance on PR, SSSP and HITS algorithm on 8 vector engines, at the same time achieving up to 1.5x, 2x and 2.5x acceleration on 2, 4 and 8 vector engines of Aurora 8 systems. Finally, this paper describes an approach to incorporate distributed graph processing support into our previously developed Vector Graph Library (VGL) framework – a novel framework for graph analytics on NEC SX-Aurora TSUBASA architecture.

References

NEC SX-Aurora TSUBASA A300-8. https://www.nec.com/en/global/solutions/hpc/sx/A300-8.html, accessed: 2021-04-06

NVGRAPH. https://developer.nvidia.com/nvgraph, accessed: 2021-06-28

Afanasyev, I.V.: Developing an architecture-independent graph framework for modern vector processors and NVIDIA GPUs. Supercomputing Frontiers and Innovations 7(4), 49–61 (2021). https://doi.org/10.14529/jsfi200404

Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H.: VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture. The Journal of Supercomputing 77(8), 8694–8715 (2021). https://doi.org/10.1007/s11227-020-03564-9

Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H., et al.: Developing efficient implementations of Bellman–Ford and Forward-Backward graph algorithms for NEC SXACE. Supercomputing Frontiers and Innovations 5(3), 65–69 (2018). https://doi.org/10.14529/jsfi180311

Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H., et al.: Analysis of relationship between SIMD-processing features used in NVIDIA GPUs and NEC SX-Aurora TSUBASA vector processors. In: International Conference on Parallel Computing Technologies. Lecture Notes in Computer Science, vol. 11657, pp. 125–139. Springer (2019). https://doi.org/10.1007/978-3-030-25636-4_10

Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H., et al.: Developing efficient implementations of shortest paths and page rank algorithms for NEC SX-Aurora TSUBASA architecture. Lobachevskii Journal of Mathematics 40(11), 1753–1762 (2019). https://doi.org/10.1134/S1995080219110039

Beamer, S., Asanović, K., Patterson, D.: Direction-optimizing breadth-first search. Scientific Programming 21(3-4), 137–148 (2013). https://doi.org/10.1109/SC.2012.50

Besta, M., Podstawski, M., Groner, L., et al.: To push or to pull: On reducing communication and synchronization in graph computations. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. pp. 93–104. ACM (2017). https://doi.org/10.1145/3078597.3078616

Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the 2004 SIAM International Conference on Data Mining. pp. 442–446. SIAM (2004). https://doi.org/10.1137/1.9781611972740.43

Egawa, R., Komatsu, K., Isobe, Y., et al.: Performance and power analysis of SX-ACE using HP-X benchmark programs. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). pp. 693–700. IEEE Computer Society (2017). https://doi.org/10.1109/CLUSTER.2017.65

Egawa, R., Komatsu, K., Kobayashi, H.: Designing an HPC refactoring catalog toward the exa-scale computing era. In: Resch, M.M., Bez, W., Focht, E., Kobayashi, H., Patel, N. (eds.) Sustained Simulation Performance 2014. pp. 91–98. Springer (2015). https://doi.org/10.1007/978-3-319-10626-7_8

Egawa, R., Komatsu, K., Kobayashi, H., et al.: Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. The Journal of Supercomputing 73(9), 3948–3976 (2017). https://doi.org/10.1007/s11227-017-1993-y

Egawa, R., Komatsu, K., Takizawa, H.: Designing an open database of system-aware code optimizations. In: 2017 Fifth International Symposium on Computing and Networking (CANDAR). pp. 369–374. IEEE Computer Society (2017). https://doi.org/10.1109/CANDAR.2017.102

Gharaibeh, A., Reza, T., Santos-Neto, E., et al.: Efficient large-scale graph processing on hybrid CPU and GPU systems. arXiv preprint arXiv:1312.3018 (2013)

Goldberg, A., Radzik, T.: A heuristic improvement of the Bellman-Ford algorithm. Tech. rep., Stanford Univ CA Dept of Computer Science (1993)

Gómez, C., Casas, M., Mantovani, F., Focht, E.: Optimizing sparse matrix-vector multiplication in NEC SX-Aurora Vector Engine. Tech. rep., Technical Report, Barcelona Supercomputing Center (2020)

Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). pp. 599–613 (2014)

Gustafson, J.L.: Reevaluating Amdahl’s law. Communications of the ACM 31(5), 532–533 (1988). https://doi.org/10.1145/42411.42415

Han, M., Daudjee, K.: Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proceedings of the VLDB Endowment 8(9), 950–961 (2015). https://doi.org/10.14778/2777598.2777604

Karypis, G., Kumar, V.: Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices (1997), https://hdl.handle.net/11299/215346

Kleinberg, J.M., Kumar, R., Raghavan, P., et al.: The web as a graph: Measurements, models, and methods. In: International Computing and Combinatorics Conference. Lecture Notes in Computer Science, vol. 1627, pp. 1–17. Springer (1999). https://doi.org/10.1007/3-540-48686-0_1

Komatsu, K., Egawa, R., Hirasawa, S., Takizawa, H., Itakura, K., Kobayashi, H.: Migration of an atmospheric simulation code to an OpenACC platform using the Xevolver framework. In: 2015 Third International Symposium on Computing and Networking (CANDAR). pp. 515–520. IEEE Computer Society (2015). https://doi.org/10.1109/CANDAR.2015.102

Komatsu, K., Egawa, R., Hirasawa, S., Takizawa, H., Itakura, K., Kobayashi, H.: Translation of large-scale simulation codes for an OpenACC platform using the Xevolver framework. International Journal of Networking and Computing 6(2), 167–180 (2016). https://doi.org/10.15803/ijnc.6.2_167

Komatsu, K., Egawa, R., Isobe, Y., et al.: An approach to the highest efficiency of the HPCG benchmark on the SX-ACE supercomputer. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC15), Poster. pp. 1–2 (2015)

Komatsu, K., Watanabe, O., Musa, A., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, USA, Nov. 11-16, 2018. pp. 54:1–54:12. SC ’18, IEEE (2018). https://doi.org/10.1109/SC.2018.00057

Liu, H., Huang, H.H.: Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–12. ACM (2015). https://doi.org/10.1145/2807591.2807594

Malewicz, G., Austern, M.H., Bik, A.J., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. pp. 135–146. ACM (2010). https://doi.org/10.1145/1807167.1807184

Meyer, U., Sanders, P.: Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms 49(1), 114–152 (2003). https://doi.org/10.1016/S0196-6774(03)00076-2

Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users Group (CUG) 19, 45–74 (2010)

Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab (1999)

Pan, Y.: Multi-GPU Graph Processing. Ph.D. thesis, University of California, Davis (2019)

Soga, T., Musa, A., Okabe, K., et al.: Performance of SOR methods on modern vector and scalar processors. Computers & Fluids 45(1), 215–221 (2011). https://doi.org/10.1016/j.compfluid.2010.12.024

Tiskin, A.: The design and analysis of bulk-synchronous parallel algorithms. Ph.D. thesis, Citeseer (1998)

Wang, Y., Davidson, A., Pan, Y., et al.: Gunrock: A high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 1–12. ACM (2016). https://doi.org/10.1145/2851141.2851145

Yamada, Y., Momose, S.: Vector engine processor of NEC brand-new supercomputer SX-Aurora TSUBASA. In: Intenational symposium on High Performance Chips (Hot Chips2018) (2018)

Zhong, J., He, B.: Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25(6), 1543–1552 (2013). https://doi.org/10.1109/TPDS.2013.111