Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations

Torsten Hoefler; Dmitry Moor

doi:10.14529/jsfi140204

Authors

Torsten Hoefler ETH Zurich, Zurich
Dmitry Moor ETH Zurich, Zurich

DOI:

https://doi.org/10.14529/jsfi140204

Abstract

Collective operations are among the most important communication operations in shared- and distributed-memory parallel applications. In this paper, we analyze the tradeoffs between energy, memory, and runtime of different algorithms to implement such operations. We show that each existing algorithms have varying behavior and no algorithm exists that is optimal in all three regards. We also show examples where of three different algorithms solving the same problem, each algorithm is best in a different metric. We conclude by posing the challenge to explore the resulting tradeoffs in a more structured manner.

References

A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP Model—One Step Closer Towards a Realistic Model for Parallel Computation. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’95, pages 95–105, New York, NY, USA, 1995. ACM.

Q. Ali, S. P. Midkiff, and V. S. Pai. Efficient High Performance Collective Communication for the Cell Blade. In Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pages 193–203, New York, NY, USA, 2009. ACM.

G. Alm´asi, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng. Optimization of MPI Collective Communication on BlueGene/L Systems. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS ’05, pages 253–262, New York, NY, USA, 2005. ACM.

M. Banikazemi and D. K. Panda. Efficient scatter communication in wormhole k-ary n-cubes with multidestination message passing, 1996.

M. Barnett, S. Gupta, D. G. Payne, L. Shuler, R. Geijn, and J.Watts. Interprocessor Collective Communication Library (InterCom). In Proceedings of the Scalable High Performance Computing Conference, pages 357–364. IEEE Computer Society Press, 1994.

M. Barnett, D. G. Payne, and R. V. D. Geijn. Optimal broadcasting in mesh-connected architectures. Technical report, 1991.

Barrett, B.W. and Brightwell, R. and Hemmert, S. and Pedretti, K. and Wheeler K. and Underwood, K.D. and Reisen, R. and Maccabe, A.B., and Hudson, T. The Portals 4.0 network programming interface. Sandia National Laboratories, November 2012. Technical Report SAND2012-10087.

E. D. Brooks, III. The butterfly barrier. Int. J. Parallel Program., 15(4):295–307, Oct. 1986.

J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D.Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8:1143 – 1156, 1997.

E. Chan, R. van de Geijn, W. Gropp, and R. Thakur. Collective communication on architectures that support simultaneous communication over multiple links. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages 2–11, New York, NY, USA, 2006. ACM.

H. A. Chen, Y. O. Carrasco, and A. W. Apon. MPI Collective Operations over IP Multicast. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing, IPDPS ’00, pages 51–60, London, UK, UK, 2000. Springer-Verlag.

G. Fagg, G. Bosilca, J. Pjeivac-Grbovic, T. Angskun, and J. Dongarra. Tuned: An Open MPI Collective Communications Component. In Distributed and Parallel Systems, pages 65–72. Springer US, 2007.

A. Faraj and X. Yuan. Automatic generation and tuning of mpi collective communication routines. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS ’05, pages 393–402, New York, NY, USA, 2005. ACM.

R. Graham, S. Poole, P. Shamis, G. Bloch, N. Bloch, H. Chapman, M. Kagan, A. Shahar, I. Rabinovitz, and G. Shainer. ConnectX-2 InfiniBand management queues: First investigation of the new support for network offloaded collective operations. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pages 53–62, 2010.

D. Hensgen, R. Finkel, and U. Manber. Two algorithms for barrier synchronization. Int. J. Parallel Program., 17(1):1–17, Feb. 1988.

T. Hoefler, A. Lichei, and W. Rehm. Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks. In Proceedings of the 21st IEEE International Parallel & Distributed Processing Symposium, PMEO’07 Workshop. IEEE Computer Society, Mar. 2007.

T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non- Blocking Collective Operations for MPI. In Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007.

T. Hoefler and T. Schneider. Optimization Principles for Collective Neighborhood Communications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 98:1–98:10. IEEE Computer Society Press, Nov. 2012.

T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10), Nov. 2010.

T. Hoefler, C. Siebert, and W. Rehm. A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast. In Proceedings of the 21st IEEE International Parallel & Distributed Processing Symposium (CAC’07 Workshop), page 232, Mar. 2007.

L. P. Huse. Collective communication on dedicated clusters of workstations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 469–476. Springer- Verlag, 1999.

G. Iannello. Efficient Algorithms for the Reduce-Scatter Operation in LogGP. IEEE Trans. Parallel Distrib. Syst., 8(9):970–982, Sept. 1997.

K. Kandalla, H. Subramoni, J. Vienne, S. Raikar, K. Tomko, S. Sur, and D. Panda. Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL. In High Performance Interconnects, 2011 IEEE 19th Annual Symposium on, pages 27–34, Aug 2011.

R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser. Optimal Broadcast and Summation in the LogP Model. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’93, pages 142–153, New York, NY, USA, 1993. ACM.

T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems. In Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’99, pages 131–140, New York, NY, USA, 1999. ACM.

E. J. Kim, G. Link, K. H. Yum, N. Vijaykrishnan, M. Kandemir, M. Irwin, and C. Das. A holistic approach to designing energy-efficient cluster interconnects. Computers, IEEE Transactions on, 54(6):660–671, Jun 2005.

S. Kini, J. Liu, J. Wu, P. Wyckoff, and D. Panda. Fast and scalable barrier using rdma and multicast mechanisms for infiniband-based clusters. In J. Dongarra, D. Laforenza, and S. Orlando, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2840 of Lecture Notes in Computer Science, pages 369–378. Springer Berlin Heidelberg, 2003.

V. A. Korthikanti and G. Agha. Towards optimizing energy costs of algorithms for shared memory architectures. In F. M. auf der Heide and C. A. Phillips, editors, SPAA, pages 157–165. ACM, 2010.

V. A. Korthikanti and G. Agha. Energy-performance trade-off analysis of parallel algorithms for shared memory architectures. In Sustainable Computing: Informatics and Systems, In Press, 2011.

MPI Forum. MPI: A Message-Passing Interface standard. Version 3.0, 2012.

P. Patarasuk and X. Yuan. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 69(2):117–124, Feb. 2009.

B. L. Payne, M. Barnett, R. Littlefield, D. G. Payne, and R. V. D. Geijn. Global combine on mesh architectures with wormhole routing. In Proc. of 7 th Int. Parallel Proc. Symp, 1993.

J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. Performance analysis of mpi collective operations. In 19th IEEE International Parallel and Distributed Processing Symposium, 2005. Proceedings., 2005.

R. Rabenseifner. Optimization of collective reduction operations. In Proceedings of the International Conference on Computational Science, ICCS 2004, pages 1–9, 2004.

S. Ramos and T. Hoefler. Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi. In Proceedings of the 22nd international symposium on Highperformance parallel and distributed computing, pages 97–108. ACM, Jun. 2013.

P. Sack. Scalable collective message-passing algorithms. In PhD Thesis. University of Illinois at Urbana-Champain, 2011.

P. Sack and W. Gropp. Faster topology-aware collective algorithms through non-minimal communication. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pages 45–54, New York, NY, USA, 2012. ACM.

P. Sanders and J. F. Sibeyn. A bandwidth latency tradeoff for broadcast and reduction. Inf. Process. Lett., 86(1):33–38, Apr. 2003.

P. Sanders, J. Speck, and J. Traff. Full bandwidth broadcast, reduction and scan with only two trees. In Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2007.

P. Sanders, J. Speck, and J. L. Tr¨aff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput., 35(12):581–594, Dec. 2009.

M. Shroff and R. A. V. D. Geijn. CollMark: MPI Collective Communication Benchmark. Technical report, 2000.

S. Sur, U. K. R. Bondhugula, A. Mamidala, H. W. Jin, and D. K. Panda. High performance rdma based all-to-all broadcast for infiniband clusters. In Proceedings of the 12th International Conference on High Performance Computing, HiPC’05, pages 148–157, Berlin, Heidelberg, 2005. Springer-Verlag.

R. Thakur and W. Gropp. Improving the performance of collective operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257267 10th European PVM/MPI Users Group Meeting, pages 257–267. Springer Verlag, 2003.

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of Collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19:49–66, 2005.

J. L. Tr¨aff and A. Ripke. Optimal broadcast for fully connected networks. In Proceedings of the First International Conference on High Performance Computing and Communications, HPCC’05, pages 45–56, Berlin, Heidelberg, 2005. Springer-Verlag.

S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. Automatically tuned collective communications. In Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, SC ’00, Washington, DC, USA, 2000. IEEE Computer Society.

A. Wagner, D. Buntinas, D. Panda, and R. Brightwell. Application-bypass reduction for large-scale clusters. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on, pages 404–411, Dec 2003.

J. Watts and R. Van De Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(02):281–292, 1995.

W. Yu, D. K. Panda, and D. Buntinas. Scalable, High-performance NIC-based All-to-all Broadcast over Myrinet/GM. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 125–134, Washington, DC, USA, 2004. IEEE Computer Society.