Distributed Graph Algorithms for Multiple Vector Engines of NEC SX-Aurora TSUBASA Systems
DOI:
https://doi.org/10.14529/jsfi210206Abstract
This paper describes the world-first attempt to develop distributed graph algorithm implementations, aimed for modern NEC SX-Aurora TSUBASA vector systems. Such systems are equipped with up to eight powerful vector engines, which are capable to significantly accelerate graph processsing and simultaneously increase the scale of processed input graphs. This paper describes distributed implementations of three widely-used graph algorithms: Page Rank (PR), Bellman-Ford Single Source Shortest Paths (further referred as SSSP) and Hyperlink-Induced Topic Search (HITS), evaluating their performance and scalability on Aurora 8 system. In this paper we describe graph partitioning strategies, communication strategies, programming models and single-VE optimizations used in these implementations. The developed implementations achieve 40, 6.6 and 1.3 GTEPS performance on PR, SSSP and HITS algorithm on 8 vector engines, at the same time achieving up to 1.5x, 2x and 2.5x acceleration on 2, 4 and 8 vector engines of Aurora 8 systems. Finally, this paper describes an approach to incorporate distributed graph processing support into our previously developed Vector Graph Library (VGL) framework – a novel framework for graph analytics on NEC SX-Aurora TSUBASA architecture.
References
NEC SX-Aurora TSUBASA A300-8. https://www.nec.com/en/global/solutions/hpc/sx/A300-8.html, accessed: 2021-04-06
NVGRAPH. https://developer.nvidia.com/nvgraph, accessed: 2021-06-28
Afanasyev, I.V.: Developing an architecture-independent graph framework for modern vector processors and NVIDIA GPUs. Supercomputing Frontiers and Innovations 7(4), 49–61 (2021). https://doi.org/10.14529/jsfi200404
Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H.: VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture. The Journal of Supercomputing 77(8), 8694–8715 (2021). https://doi.org/10.1007/s11227-020-03564-9
Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H., et al.: Developing efficient implementations of Bellman–Ford and Forward-Backward graph algorithms for NEC SXACE. Supercomputing Frontiers and Innovations 5(3), 65–69 (2018). https://doi.org/10.14529/jsfi180311
Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H., et al.: Analysis of relationship between SIMD-processing features used in NVIDIA GPUs and NEC SX-Aurora TSUBASA vector processors. In: International Conference on Parallel Computing Technologies. Lecture Notes in Computer Science, vol. 11657, pp. 125–139. Springer (2019). https://doi.org/10.1007/978-3-030-25636-4_10
Afanasyev, I.V., Voevodin, V.V., Komatsu, K., Kobayashi, H., et al.: Developing efficient implementations of shortest paths and page rank algorithms for NEC SX-Aurora TSUBASA architecture. Lobachevskii Journal of Mathematics 40(11), 1753–1762 (2019). https://doi.org/10.1134/S1995080219110039
Beamer, S., Asanović, K., Patterson, D.: Direction-optimizing breadth-first search. Scientific Programming 21(3-4), 137–148 (2013). https://doi.org/10.1109/SC.2012.50
Besta, M., Podstawski, M., Groner, L., et al.: To push or to pull: On reducing communication and synchronization in graph computations. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. pp. 93–104. ACM (2017). https://doi.org/10.1145/3078597.3078616
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the 2004 SIAM International Conference on Data Mining. pp. 442–446. SIAM (2004). https://doi.org/10.1137/1.9781611972740.43
Egawa, R., Komatsu, K., Isobe, Y., et al.: Performance and power analysis of SX-ACE using HP-X benchmark programs. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). pp. 693–700. IEEE Computer Society (2017). https://doi.org/10.1109/CLUSTER.2017.65
Egawa, R., Komatsu, K., Kobayashi, H.: Designing an HPC refactoring catalog toward the exa-scale computing era. In: Resch, M.M., Bez, W., Focht, E., Kobayashi, H., Patel, N. (eds.) Sustained Simulation Performance 2014. pp. 91–98. Springer (2015). https://doi.org/10.1007/978-3-319-10626-7_8
Egawa, R., Komatsu, K., Kobayashi, H., et al.: Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. The Journal of Supercomputing 73(9), 3948–3976 (2017). https://doi.org/10.1007/s11227-017-1993-y
Egawa, R., Komatsu, K., Takizawa, H.: Designing an open database of system-aware code optimizations. In: 2017 Fifth International Symposium on Computing and Networking (CANDAR). pp. 369–374. IEEE Computer Society (2017). https://doi.org/10.1109/CANDAR.2017.102
Gharaibeh, A., Reza, T., Santos-Neto, E., et al.: Efficient large-scale graph processing on hybrid CPU and GPU systems. arXiv preprint arXiv:1312.3018 (2013)
Goldberg, A., Radzik, T.: A heuristic improvement of the Bellman-Ford algorithm. Tech. rep., Stanford Univ CA Dept of Computer Science (1993)
Gómez, C., Casas, M., Mantovani, F., Focht, E.: Optimizing sparse matrix-vector multiplication in NEC SX-Aurora Vector Engine. Tech. rep., Technical Report, Barcelona Supercomputing Center (2020)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). pp. 599–613 (2014)
Gustafson, J.L.: Reevaluating Amdahl’s law. Communications of the ACM 31(5), 532–533 (1988). https://doi.org/10.1145/42411.42415
Han, M., Daudjee, K.: Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proceedings of the VLDB Endowment 8(9), 950–961 (2015). https://doi.org/10.14778/2777598.2777604
Karypis, G., Kumar, V.: Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices (1997), https://hdl.handle.net/11299/215346
Kleinberg, J.M., Kumar, R., Raghavan, P., et al.: The web as a graph: Measurements, models, and methods. In: International Computing and Combinatorics Conference. Lecture Notes in Computer Science, vol. 1627, pp. 1–17. Springer (1999). https://doi.org/10.1007/3-540-48686-0_1
Komatsu, K., Egawa, R., Hirasawa, S., Takizawa, H., Itakura, K., Kobayashi, H.: Migration of an atmospheric simulation code to an OpenACC platform using the Xevolver framework. In: 2015 Third International Symposium on Computing and Networking (CANDAR). pp. 515–520. IEEE Computer Society (2015). https://doi.org/10.1109/CANDAR.2015.102
Komatsu, K., Egawa, R., Hirasawa, S., Takizawa, H., Itakura, K., Kobayashi, H.: Translation of large-scale simulation codes for an OpenACC platform using the Xevolver framework. International Journal of Networking and Computing 6(2), 167–180 (2016). https://doi.org/10.15803/ijnc.6.2_167
Komatsu, K., Egawa, R., Isobe, Y., et al.: An approach to the highest efficiency of the HPCG benchmark on the SX-ACE supercomputer. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC15), Poster. pp. 1–2 (2015)
Komatsu, K., Watanabe, O., Musa, A., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, USA, Nov. 11-16, 2018. pp. 54:1–54:12. SC ’18, IEEE (2018). https://doi.org/10.1109/SC.2018.00057
Liu, H., Huang, H.H.: Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–12. ACM (2015). https://doi.org/10.1145/2807591.2807594
Malewicz, G., Austern, M.H., Bik, A.J., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. pp. 135–146. ACM (2010). https://doi.org/10.1145/1807167.1807184
Meyer, U., Sanders, P.: Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms 49(1), 114–152 (2003). https://doi.org/10.1016/S0196-6774(03)00076-2
Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users Group (CUG) 19, 45–74 (2010)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab (1999)
Pan, Y.: Multi-GPU Graph Processing. Ph.D. thesis, University of California, Davis (2019)
Soga, T., Musa, A., Okabe, K., et al.: Performance of SOR methods on modern vector and scalar processors. Computers & Fluids 45(1), 215–221 (2011). https://doi.org/10.1016/j.compfluid.2010.12.024
Tiskin, A.: The design and analysis of bulk-synchronous parallel algorithms. Ph.D. thesis, Citeseer (1998)
Wang, Y., Davidson, A., Pan, Y., et al.: Gunrock: A high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 1–12. ACM (2016). https://doi.org/10.1145/2851141.2851145
Yamada, Y., Momose, S.: Vector engine processor of NEC brand-new supercomputer SX-Aurora TSUBASA. In: Intenational symposium on High Performance Chips (Hot Chips2018) (2018)
Zhong, J., He, B.: Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25(6), 1543–1552 (2013). https://doi.org/10.1109/TPDS.2013.111
Downloads
Published
How to Cite
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.