Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications

Michael Knobloch; Bernd Mohr

doi:10.14529/jsfi200105

Authors

Michael Knobloch Forschungszentrum Jülich GmbH
Bernd Mohr Forschungszentrum Jülich GmbH

DOI:

https://doi.org/10.14529/jsfi200105

Abstract

General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems.However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales.

In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of different GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.

References

Adhianto, L., Banerjee, S., Fagan, M., et al.: HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22(6), 685–701 (2010), DOI: 10.1002/cpe.1553

Beckingsale, D.A., Burmark, J., Hornung, R., et al.: RAJA: Portable Performance for Large-Scale Scientific Applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC, 22-22 Nov. 2019, Denver, CO, USA. pp. 71–81. IEEE (2019), DOI: 10.1109/P3HPC49587.2019.00012

Benedict, S., Petkov, V., Gerndt, M.: Periscope: An online-based distributed performance analysis tool. In: Tools for High Performance Computing 2009, Sept. 2009, Dresden, Germany. pp. 1–16. Springer (2010), DOI: 10.1007/978-3-642-11261-4_1

Bernat, A.R., Miller, B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools, Szeged, Hungary. pp. 9–16. Association for Computing Machinery, New York, NY, USA (2011), DOI: 10.1145/2024569.2024572

Bradley, T.: GPU performance analysis and optimisation. In: NVIDIA Corporation (2012)

Chabbi, M., Murthy, K., Fagan, M., et al.: Effective sampling-driven performance tools for GPU-accelerated supercomputers. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 17-22 Nov. 2013, Denver, CO, USA. pp. 1–12. IEEE (2013), DOI: 10.1145/2503210.2503299

Cramer, T., Dietrich, R., Terboven, C., et al.: Performance analysis for target devices with the OpenMP tools interface. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 25-29 May 2015, Hyderabad, India. pp. 215–224. IEEE (2015), DOI: 10.1109/IPDPSW.2015.27

Dietrich, R., Juckeland, G., Wolfe, M.: OpenACC programs examined: a performance analysis approach. In: 2015 44th International Conference on Parallel Processing, 1-4 Sept. 2015, Beijing, China. pp. 310–319. IEEE (2015), DOI: 10.1109/ICPP.2015.40

Dietrich, R., Tschuter, R.: A generic infrastructure for OpenCL performance analysis. In: 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS, 24-26 Sept. 2015, Warsaw, Poland. vol. 1, pp. 334–341. IEEE (2015), DOI: 10.1109/IDAACS.2015.7340754

Dietrich, R., Tschuter, R., Cramer, T., et al.: Evaluation of Tool Interface Standards for Performance Analysis of OpenACC and OpenMP Programs. In: Tools for High Performance Computing 2015, Sept. 2015, Dresden, Germany. pp. 67–83. Springer, Cham (2016), DOI: 10.1007/978-3-319-39589-0_6

Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12), 3202–3216 (2014), DOI: 10.1016/j.jpdc.2014.07.003

Eichenberger, A., Mellor-Crummey, J., Schulz, M., et al.: OMPT and OMPD: OpenMP tools application programming interfaces for performance analysis and debugging. In: International Workshop on OpenMP, IWOMP 2013 (2013)

Eichenberger, A.E., Mellor-Crummey, J., Schulz, M., et al.: OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis. In: OpenMP in the Era of Low Power Devices and Accelerators, IWOMP 2013, 16-18 Sept. 2013, Canberra, ACT, Australia. pp. 171–185. Springer, Berlin, Heidelberg (2013), DOI: 10.1007/978-3-642-40698-0_13

Eschweiler, D., Wagner, M., Geimer, M., et al.: Open Trace Format 2: The Next Generation of Scalable Trace Formats and Support Libraries. In: PARCO. vol. 22, pp. 481–490 (2011), DOI: 10.3233/978-1-61499-041-3-481

Feld, C., Convent, S., Hermanns, M.A., et al.: Score-P and OMPT: Navigating the Perils of Callback-Driven Parallel Runtime Introspection. In: International Workshop on OpenMP, IWOMP 2019, 11-13 Sept. 2019, Auckland, New Zealand. pp. 21–35. Springer (2019), DOI: 10.1007/978-3-030-28596-8_2

Geimer, M., Wolf, F., Wylie, B.J., et al.: The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22(6), 702–719 (2010), DOI: 10.1002/cpe.1556

Gerfin, G., Venkataraman, V.: Debugging Experience with CUDA-GDB and CUDAMEMCHECK. In: GPU Technology Conference, GTC (2012)

Gottbrath, C., L¨udtke, R.: Debugging CUDA Accelerated Parallel Applications with TotalView (2012)

Hammond, S.D., Trott, C.R., Ibanez, D., et al.: Profiling and Debugging Support for the Kokkos Programming Model. In: International Conference on High Performance Computing, 28 June 2018, Frankfurt/Main, Germany. pp. 743–754. Springer (2018), DOI: 10.1007/978-3-030-02465-9_53

Iyer, K., Kiel, J.: GPU Debugging and Profiling with NVIDIA Parallel Nsight. In: Game Development Tools, pp. 303–324. AK Peters/CRC Press (2016)

January, C., Byrd, J., Or´o, X., et al.: Allinea MAP: Adding Energy and OpenMP Profiling Without Increasing Overhead. In: Tools for High Performance Computing 2014. pp. 25–35. Springer, Cham (2015), DOI: 10.1007/978-3-319-16012-2_2

Knupfer, A., Brunst, H., Doleschal, J., et al.: The vampir performance analysis tool-set. In: Tools for High Performance Computing, July 2008, Stuttgart, Germany. pp. 139–155. Springer (2008), DOI: 10.1007/978-3-540-68564-7_9

Knupfer, A., Rossel, C., an Mey, D., et al.: Score-P – A joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Proc. of the 5th Int’l Workshop on Parallel Tools for High Performance Computing, Sept. 2011, Dresden, Germany. pp. 79–91. Springer (2012)

Kraus, J.: CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX. https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/ (2013)

Lawrence Livermore National Laboratory: Sierra. https://computing.llnl.gov/computers/sierra (2020)

Malony, A.D., Biersdorff, S., Shende, S., et al.: Parallel performance measurement of heterogeneous parallel systems with gpus. In: Proceedings of the International Conference on Parallel Processing, ICPP 2011, 13-16 Sept. 2011, Taipei, Taiwan. pp. 176–185. IEEE (2011), DOI: 10.1109/ICPP.2011.71

Mayanglambam, S., Malony, A.D., Sottile, M.J.: Performance measurement of applications with GPU acceleration using CUDA. Advances in Parallel Computing 19, 341–348 (2010), DOI: 10.3233/978-1-60750-530-3-341

Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 3.1 (2015), https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

Messina, P.: The exascale computing project. Computing in Science & Engineering 19(3), 63–67 (2017), DOI: 10.1109/MCSE.2017.57

Mohr, B.: Scalable parallel performance measurement and analysis tools – state-of-the-art and future challenges. Supercomputing Frontiers and Innovations 1(2) (2014), DOI: 10.14529/jsfi140207

Mucci, P.J., Browne, S., Deane, C., et al.: PAPI: A portable interface to hardware performance counters. In: Proceedings of the department of defense HPCMP users group conference. vol. 710, pp. 7–10 (1999)

Nickolls, J., Buck, I., Garland, M., et al.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008), DOI: 10.1145/1365490.1365500

Oak Ridge National Laboratory: Summit. https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ (2020)

OpenACC-Standard.org: The OpenACC Application Programming Interface 2.6 (2017), https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf

OpenMP Architecture Review Board: OpenMP Application Programming Interface Version 4.0 (2013), https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf

OpenMP Architecture Review Board: OpenMP Application Programming Interface Version 5.0 (2018), https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

Pillet, V., Labarta, J., Cortes, T., et al.: Paraver: A tool to visualize and analyze parallel code. In: Proceedings of WoTUG-18: transputer and occam developments. vol. 44, pp. 17–31. CiteSeer (1995)

Reyes, R.: Codeplay contribution to DPC++ brings SYCL support for NVIDIA GPUs. https://www.codeplay.com/portal/02-03-20-codeplay-contribution-to-dpcpp-brings-sycl-support-for-nvidia-gpus (2020)

Saviankou, P., Knobloch, M., Visser, A., et al.: Cube v4: From Performance Report Explorer to Performance Analysis Tool. Procedia Computer Science 51, 1343–1352 (2015), DOI: 10.1016/j.procs.2015.05.320

Servat, H., Llort, G., Gimenez, J., et al.: Detailed performance analysis using coarse grain sampling. In: European Conference on Parallel Processing, 25-28 Aug. 2009, Delft, The Netherlands. pp. 185–198. Springer, Berlin, Heidelberg (2009), DOI: 10.1007/978-3-642-14122-5_23

Shende, S., Chaimov, N., Malony, A., et al.: Multi-Level Performance Instrumentation for Kokkos Applications using TAU. In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools, ProTools, 17 Nov. 2019, Denver, CO, USA. pp. 48–54. IEEE (2019), DOI: 10.1109/ProTools49597.2019.00012

Shende, S.S., Malony, A.D.: The TAU parallel performance system. The International Journal of High Performance Computing Applications 20(2), 287–311 (2006), DOI: 10.1177/1094342006064482

Wienke, S., Springer, P., Terboven, C., et al.: OpenACC – first experiences with real-world applications. In: European Conference on Parallel Processing, Euro-Par 2012, 27-31 Aug. 2012, Rhodes Island, Greece. pp. 859–870. Springer (2012), DOI: 10.1007/978-3-642-32820-6_85