Model-Driven One-Sided Factorizations on Multicore Accelerated Systems

Jack Dongarra; Azzam Haidar; Jakub Kurzak; Piotr Luszczek; Stanimire Tomov; Asim YarKhan

doi:10.14529/jsfi140105

Authors

Jack Dongarra University of Tennessee Knoxville, Knoxville Oak Ridge National Laboratory, Oak Ridge University of Manchester, Manchester M13 9PL
Azzam Haidar University of Tennessee Knoxville, Knoxville
Jakub Kurzak University of Tennessee Knoxville, Knoxville
Piotr Luszczek University of Tennessee Knoxville, Knoxville
Stanimire Tomov University of Tennessee Knoxville, Knoxville
Asim YarKhan University of Tennessee Knoxville, Knoxville

DOI:

https://doi.org/10.14529/jsfi140105

Abstract

Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs). Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.

References

Intel Xeon Phi Coprocessor System Software Developers Guide. http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide.

E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan.

PLASMA Users Guide. Technical report, ICL, University of Tennessee, 2010.

J. Auerbach, D. F. Bacon, I. Burcea, P. Cheng, S. J. Fink, R. Rabbah, and S. Shukla. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference, DAC'12, pages 271-276, New York, NY, USA, 2012. ACM.

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187-198, 2011.

R. Barik, Z. Budimlic, V. Cav`e, S. Chatterjee, Y. Guo, D. Peixotto, R. Raman, J. Shirako, S. Tasirlar, Y. Yan, Y. Zhao, and V. Sarkar. The Habanero Multicore Software Research Project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA’09, pages 735-736, New York, NY, USA, 2009. ACM.

N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.

A. J. Bernstein. Analysis of programs for parallel processing. IEEE Transactions on Electronic Computers, EC-15(5):757-763, October 1966.

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. SIGPLAN Not., 30:207-216, August 1995.

C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. clMAGMA: High Performance Dense Linear Algebra with OpenCL. In International Workshop on OpenCL, IWOCL 2013, Atlanta, Georgia, USA, May 13-14 2013.

E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the nineteenth annual ACM symposium on parallel algorithms and architectures, SPAA’07, pages 116-125, New York, NY, USA, 2007. ACM.

NVIDIA CUBLAS library. https://developer.nvidia.com/cublas.

J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. In 10th International Conference on Parallel Processing and Applied Mathematics, PPAM 2013, Warsaw, Poland, September 8-11 2013.

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC’06, New York, NY, USA, 2006. ACM.

C. H. Gonzalez and B. B. Fraguela. A framework for argument-based task synchronization with automatic

detection of dependencies. Parallel Computing, 39(9):475 - 489, 2013. Novel On-Chip Parallel Architectures and Software Support.

Intel. Math Kernel Library. http://software.intel.com/intel-mkl/.

G. Kahn. The semantics of simple language for parallel programming. In IFIP Congress, pages 471-475, 1974.

J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, and J. Dongarra. Multithreading in the PLASMA Library. In Handbook of Multi and Many-Core Processing: Architecture, Algorithms, Programming, and Applications, Computer and Information Science Series. Chapman and Hall/CRC, April 26 2013.

L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558-565, July 1978.

MAGMA library. http://icl.cs.utk.edu/magma/.

R. Nath, S. Tomov, and J. Dongarra. An improved MAGMA GEMM for Fermi graphics processing units. Int. J. High Perf. Comput. Applic., 24(4):511-515, 2010. http://dx.doi.org/10.1177/1094342010385729 DOI: 10.1177/1094342010385729.

C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty, R. Narayanaswamy, J. Wiegert, F. Chinchilla, and R. McGuire. Offload compiler runtime for the intel xeon phitm coprocessor. In ISC, pages 239-254, 2013.

J. M. Perez, R. M. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September - 1 October 2008, Tsukuba, Japan, pages 142-151. IEEE, 2008.

M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: a high-level, machine-independent language for parallel programming. Computer, 26(6):28-38, 1993. http://dx.doi.org/10.1109/2.214440 DOI: 10.1109/2.214440.

J. E. Rodrigues. A graph model for parallel computations. Technical Report MIT/LCS/TR-64, MIT, Cambridge, MA, USA, Sept. 1969.

C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP’13, pages 49-68, New York, NY, USA, 2013. ACM.

F. Song, S. Tomov, and J. Dongarra. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and multi-GPU Systems. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS’12, pages 365-376, New York, NY, USA, 2012. ACM.

P. E. Strazdins. Lookahead and algorithmic blocking techniques compared for parallel matrix factorization. In 10th International Conference on Parallel and Distributed Computing and Systems, IASTED, Las Vegas, USA, 1998.

P. E. Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Int. J. Parallel Distrib. Systems Networks, 4(1):26-35, 2001.

L. G. Valiant. Bulk-synchronous parallel computers. In M. Reeve, editor, Parallel Processing and Artificial Intelligence, pages 15-22. John Wiley & Sons, 1989.

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC'08, Austin, TX, November 15-21 2008. IEEE Press. http://dx.doi.org/10.1145/1413370.1413402 DOI: 10.1145/1413370.1413402.

A. YarKhan. Dynamic Task Execution on Shared and Distributed Memory Architectures. PhD thesis, University of Tennessee, December 2012.

A. YarKhan, J. Kurzak, and J. Dongarra. QUARK Users' Guide: QUeueing And Runtime for Kernels. Technical report, Innovative Computing Laboratory, University of Tennessee, 2011.