Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications
DOI:
https://doi.org/10.14529/jsfi200105Abstract
General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems.However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales.
In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of different GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.
References
Adhianto, L., Banerjee, S., Fagan, M., et al.: HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22(6), 685–701 (2010), DOI: 10.1002/cpe.1553
Beckingsale, D.A., Burmark, J., Hornung, R., et al.: RAJA: Portable Performance for Large-Scale Scientific Applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC, 22-22 Nov. 2019, Denver, CO, USA. pp. 71–81. IEEE (2019), DOI: 10.1109/P3HPC49587.2019.00012
Benedict, S., Petkov, V., Gerndt, M.: Periscope: An online-based distributed performance analysis tool. In: Tools for High Performance Computing 2009, Sept. 2009, Dresden, Germany. pp. 1–16. Springer (2010), DOI: 10.1007/978-3-642-11261-4_1
Bernat, A.R., Miller, B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools, Szeged, Hungary. pp. 9–16. Association for Computing Machinery, New York, NY, USA (2011), DOI: 10.1145/2024569.2024572
Bradley, T.: GPU performance analysis and optimisation. In: NVIDIA Corporation (2012)
Chabbi, M., Murthy, K., Fagan, M., et al.: Effective sampling-driven performance tools for GPU-accelerated supercomputers. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 17-22 Nov. 2013, Denver, CO, USA. pp. 1–12. IEEE (2013), DOI: 10.1145/2503210.2503299
Cramer, T., Dietrich, R., Terboven, C., et al.: Performance analysis for target devices with the OpenMP tools interface. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 25-29 May 2015, Hyderabad, India. pp. 215–224. IEEE (2015), DOI: 10.1109/IPDPSW.2015.27
Dietrich, R., Juckeland, G., Wolfe, M.: OpenACC programs examined: a performance analysis approach. In: 2015 44th International Conference on Parallel Processing, 1-4 Sept. 2015, Beijing, China. pp. 310–319. IEEE (2015), DOI: 10.1109/ICPP.2015.40
Dietrich, R., Tschuter, R.: A generic infrastructure for OpenCL performance analysis. In: 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS, 24-26 Sept. 2015, Warsaw, Poland. vol. 1, pp. 334–341. IEEE (2015), DOI: 10.1109/IDAACS.2015.7340754
Dietrich, R., Tschuter, R., Cramer, T., et al.: Evaluation of Tool Interface Standards for Performance Analysis of OpenACC and OpenMP Programs. In: Tools for High Performance Computing 2015, Sept. 2015, Dresden, Germany. pp. 67–83. Springer, Cham (2016), DOI: 10.1007/978-3-319-39589-0_6
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12), 3202–3216 (2014), DOI: 10.1016/j.jpdc.2014.07.003
Eichenberger, A., Mellor-Crummey, J., Schulz, M., et al.: OMPT and OMPD: OpenMP tools application programming interfaces for performance analysis and debugging. In: International Workshop on OpenMP, IWOMP 2013 (2013)
Eichenberger, A.E., Mellor-Crummey, J., Schulz, M., et al.: OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis. In: OpenMP in the Era of Low Power Devices and Accelerators, IWOMP 2013, 16-18 Sept. 2013, Canberra, ACT, Australia. pp. 171–185. Springer, Berlin, Heidelberg (2013), DOI: 10.1007/978-3-642-40698-0_13
Eschweiler, D., Wagner, M., Geimer, M., et al.: Open Trace Format 2: The Next Generation of Scalable Trace Formats and Support Libraries. In: PARCO. vol. 22, pp. 481–490 (2011), DOI: 10.3233/978-1-61499-041-3-481
Feld, C., Convent, S., Hermanns, M.A., et al.: Score-P and OMPT: Navigating the Perils of Callback-Driven Parallel Runtime Introspection. In: International Workshop on OpenMP, IWOMP 2019, 11-13 Sept. 2019, Auckland, New Zealand. pp. 21–35. Springer (2019), DOI: 10.1007/978-3-030-28596-8_2
Geimer, M., Wolf, F., Wylie, B.J., et al.: The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22(6), 702–719 (2010), DOI: 10.1002/cpe.1556
Gerfin, G., Venkataraman, V.: Debugging Experience with CUDA-GDB and CUDAMEMCHECK. In: GPU Technology Conference, GTC (2012)
Gottbrath, C., L¨udtke, R.: Debugging CUDA Accelerated Parallel Applications with TotalView (2012)
Hammond, S.D., Trott, C.R., Ibanez, D., et al.: Profiling and Debugging Support for the Kokkos Programming Model. In: International Conference on High Performance Computing, 28 June 2018, Frankfurt/Main, Germany. pp. 743–754. Springer (2018), DOI: 10.1007/978-3-030-02465-9_53
Iyer, K., Kiel, J.: GPU Debugging and Profiling with NVIDIA Parallel Nsight. In: Game Development Tools, pp. 303–324. AK Peters/CRC Press (2016)
January, C., Byrd, J., Or´o, X., et al.: Allinea MAP: Adding Energy and OpenMP Profiling Without Increasing Overhead. In: Tools for High Performance Computing 2014. pp. 25–35. Springer, Cham (2015), DOI: 10.1007/978-3-319-16012-2_2
Knupfer, A., Brunst, H., Doleschal, J., et al.: The vampir performance analysis tool-set. In: Tools for High Performance Computing, July 2008, Stuttgart, Germany. pp. 139–155. Springer (2008), DOI: 10.1007/978-3-540-68564-7_9
Knupfer, A., Rossel, C., an Mey, D., et al.: Score-P – A joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Proc. of the 5th Int’l Workshop on Parallel Tools for High Performance Computing, Sept. 2011, Dresden, Germany. pp. 79–91. Springer (2012)
Kraus, J.: CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX. https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/ (2013)
Lawrence Livermore National Laboratory: Sierra. https://computing.llnl.gov/computers/sierra (2020)
Malony, A.D., Biersdorff, S., Shende, S., et al.: Parallel performance measurement of heterogeneous parallel systems with gpus. In: Proceedings of the International Conference on Parallel Processing, ICPP 2011, 13-16 Sept. 2011, Taipei, Taiwan. pp. 176–185. IEEE (2011), DOI: 10.1109/ICPP.2011.71
Mayanglambam, S., Malony, A.D., Sottile, M.J.: Performance measurement of applications with GPU acceleration using CUDA. Advances in Parallel Computing 19, 341–348 (2010), DOI: 10.3233/978-1-60750-530-3-341
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 3.1 (2015), https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
Messina, P.: The exascale computing project. Computing in Science & Engineering 19(3), 63–67 (2017), DOI: 10.1109/MCSE.2017.57
Mohr, B.: Scalable parallel performance measurement and analysis tools – state-of-the-art and future challenges. Supercomputing Frontiers and Innovations 1(2) (2014), DOI: 10.14529/jsfi140207
Mucci, P.J., Browne, S., Deane, C., et al.: PAPI: A portable interface to hardware performance counters. In: Proceedings of the department of defense HPCMP users group conference. vol. 710, pp. 7–10 (1999)
Nickolls, J., Buck, I., Garland, M., et al.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008), DOI: 10.1145/1365490.1365500
Oak Ridge National Laboratory: Summit. https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ (2020)
OpenACC-Standard.org: The OpenACC Application Programming Interface 2.6 (2017), https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf
OpenMP Architecture Review Board: OpenMP Application Programming Interface Version 4.0 (2013), https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf
OpenMP Architecture Review Board: OpenMP Application Programming Interface Version 5.0 (2018), https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf
Pillet, V., Labarta, J., Cortes, T., et al.: Paraver: A tool to visualize and analyze parallel code. In: Proceedings of WoTUG-18: transputer and occam developments. vol. 44, pp. 17–31. CiteSeer (1995)
Reyes, R.: Codeplay contribution to DPC++ brings SYCL support for NVIDIA GPUs. https://www.codeplay.com/portal/02-03-20-codeplay-contribution-to-dpcpp-brings-sycl-support-for-nvidia-gpus (2020)
Saviankou, P., Knobloch, M., Visser, A., et al.: Cube v4: From Performance Report Explorer to Performance Analysis Tool. Procedia Computer Science 51, 1343–1352 (2015), DOI: 10.1016/j.procs.2015.05.320
Servat, H., Llort, G., Gimenez, J., et al.: Detailed performance analysis using coarse grain sampling. In: European Conference on Parallel Processing, 25-28 Aug. 2009, Delft, The Netherlands. pp. 185–198. Springer, Berlin, Heidelberg (2009), DOI: 10.1007/978-3-642-14122-5_23
Shende, S., Chaimov, N., Malony, A., et al.: Multi-Level Performance Instrumentation for Kokkos Applications using TAU. In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools, ProTools, 17 Nov. 2019, Denver, CO, USA. pp. 48–54. IEEE (2019), DOI: 10.1109/ProTools49597.2019.00012
Shende, S.S., Malony, A.D.: The TAU parallel performance system. The International Journal of High Performance Computing Applications 20(2), 287–311 (2006), DOI: 10.1177/1094342006064482
Wienke, S., Springer, P., Terboven, C., et al.: OpenACC – first experiences with real-world applications. In: European Conference on Parallel Processing, Euro-Par 2012, 27-31 Aug. 2012, Rhodes Island, Greece. pp. 859–870. Springer (2012), DOI: 10.1007/978-3-642-32820-6_85
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.