Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

Rabab Al-Omairy; Guillermo Miranda; Hatem Ltaief; Rosa M. Badia; Xavier Martorell; Jesus Labarta; David Keyes

doi:10.14529/jsfi150103

Authors

Rabab Al-Omairy KAUST, Thuwal
Guillermo Miranda Barcelona Supercomputing Center, Centro Nacional de Supercomputacion, Barcelona
Hatem Ltaief KAUST, Thuwal
Rosa M. Badia Barcelona Supercomputing Center, Centro Nacional de Supercomputacion, Barcelona Articial Intelligence Research Institute (IIIA) - Spanish National Research Council (CSIC)
Xavier Martorell Barcelona Supercomputing Center, Centro Nacional de Supercomputacion, Barcelona Universitat Politecnica de Catalunya, Barcelona
Jesus Labarta Barcelona Supercomputing Center, Centro Nacional de Supercomputacion, Barcelona Universitat Politecnica de Catalunya, Barcelona
David Keyes KAUST, Thuwal

DOI:

https://doi.org/10.14529/jsfi150103

Abstract

We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.

References

A. Charara, H. Ltaief, D. Gratadour, D. Keyes, A. Sevin, A. Abdelfattah, E. Gendron, C. Morel and Fabrice Vidal. Pipelining Computational Stages of the Tomographic Reconstructor for Multi-object Adaptive Optics on a Multi-GPU System. SC '14: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 262-273, 2014.

E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg. Towards an Efficient Tile Matrix Inversion of Symmetric Positive Denite Matrices on Multicore Architectures. High Performance Computing for Computational Science VECPAR 2010, 6449:129-138, 2011.

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. J. Phys.: Conf. Ser., 180(1), 2009.

E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. LAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, 3rd edition, 1999.

E. Ayguade, R. M. Badia, D. Cabrera, A. Duran, M. Gonzalez, F. Igual, D. Jimenez, J. Labarta, X. Martorell, R. Mayo, J. M. Perez, and E. S. Quintana-Ort. A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures. In Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism, IWOMP '09, pages 154-167, Berlin, Heidelberg, 2009. Springer-Verlag.

L. Suzan Blackford, J. Choi, A. Cleary, E. F. D'Azevedo, J. W. Demmel, I. S. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. W. Walker, and R. Clint Whaley. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1997.

F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P.Wacrenier, and R. Namyst. Structuring the Execution of OpenMP Applications for Multicore Architectures. In IEEE International Parallel and Distributed Processing Symposium, pages 1-10, 2010.

F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst. hwloc: a Generic Framework for Managing Hardware Anities in HPC Applications. In IEEE, editor, PDP 2010 - The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Pisa, Italie, February 2010.

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures. Parallel Computing, 35(1):38-53, 2009.

E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-core Architectures. In SPAA '07: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 116-125, New York, NY, USA, 2007. ACM.

Q. Chen, M. Guo, and H. Guan. LAWS: Locality-Aware Work-Stealing for Multi-Socket Multi-core Architectures. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS '14, pages 3-12, New York, NY, USA, 2014. ACM.

J. Dongarra and P. Beckman et al. The International Exascale Software Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011.

A. Drebes, K. Heydemann, N. Drach, A. Pop, and A. Cohen. Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages. ACM Trans. Archit. Code Optim., 11(3):30:1-30:25, August 2014.

A. Duran, R. Ferrer, E. Ayguade, R. M. Badia, and J. Labarta. A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks. International Journal of Parallel Programming, 37(3):292-305, 2009.

K. Faxen. Wool -- A Work Stealing Library. SIGARCH Comput. Archit. News, 36(5):93-100, June 2009.

G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, Maryland, third edition, 1996.

N. Higham. Accuracy and Stability of Numerical Algorithms, Second Edition. SIAM, 2002.

E. Jeannot. Symbolic Mapping and Allocation for the Cholesky Factorization on NUMA Machines: Results and Optimizations. IJHPCA, 27(3):283-290, 2013.

L. Karlsson and B. Kagstrom. Parallel Two-Stage Reduction to Hessenberg Form Using Dynamic Scheduling on Shared-Memory Architectures. Parallel Computing, 2011. DOI:10.1016/j.parco.2011.05.001.

D. E. Keyes. Exaop/s: The Why and The How. Comptes Rendus Mcanique, 339(23):70-77, 2011. Le Calcul Intensif.

J. Kurzak, A. Buttari, and J. J. Dongarra. Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization. IEEE Transactions on Parallel and Distributed Systems, 19(9):1-11, September 2008.

J. Kurzak, H. Ltaief, J. J. Dongarra, and R. M. Badia. Scheduling Dense Linear Algebra Operations on Multicore Processors. Concurrency Computat.: Pract. Exper., 21(1):15-44, 2009.

H. Ltaief, J. Kurzak, and J. Dongarra. Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures. IEEE Transactions on Parallel and Distributed Systems, 21(4), April 2010.

A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-Aware Task Scheduling and Data Distribution on NUMA Systems. In OpenMP in the Era of Low Power Devices and Accelerators, pages 156-170. Springer, 2013.

G. Quintana-Ort, E. S. Quintana-Ort, E. Chan, R. A. van de Geijn, and F. G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. In PDP, pages 301-310. IEEE Computer Society, 2008.

R. Yokota, G. Turkiyyah and D. Keyes. Communication Complexity of the Fast Multipole Method and its Algebraic Variants. International Journal of Supercomputing frontiers and innovations, 1(1), 2014.

University of Tennessee Knoxville. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.4.5, November 2012.

Y. Yan, J. Zhao, Y. Guo, and V. Sarkar. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement. In Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing, LCPC'09, pages 172-187, Berlin, Heidelberg, 2010. Springer-Verlag.

A. YarKhan, J. Kurzak, and J. Dongarra. QUARK Users' Guide: QUeueing And Runtime for Kernels. University of Tennessee Innovative Computing Laboratory Technical Report ICL-UT-11-02, 2011.

F. G. Van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Orti, and G. Quintana-Orti. The libflame Library for Dense Matrix Computations. Computing in Science and Engineering, 11(6):56-63, November/December 2009.