Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
Highlights
-
I like their framing of collective communication as a multi-commodity flow (MCF) problem as well as their general MILP solution formulation. The support of multi-cast/copy along with utilization of buffers for store and forwarding is also pretty cool.
-
Overall a pretty solid paper and enjoyable to read
Summary
Collective communication is a fundemental operation in distributed machine learning that optimizes data exchange across multiple GPUs beyond simple unicast communication. However, existing solutions face significant scalability challenges. This paper represents a novel traffic-engineering inspired framework that models the collective communication based off of a multi-commodity flow problem augmented with both (a) copy, (b) store and forward buffering. They offer 3 solutions that vary in the optimality-scalability tradeoff. The objective is finishing the traffic as quickly as possible.
Key Contributions
-
Introduction of three solution algorithms for collective communication efficiency improvement:
-
- MILP (Mixed-integer linear programming) - The most general and accurate solution which supports multicast and copy.
-
- LP (Linear Programming) - Omits the copy part to improve scalability
-
- A*-based heuristic - Has the best scalability at the expense of optimality. Also supports copy.
Strengths
-
Solid description of each algorithm as well as formulation of the solution
-
Clear connection and description of the trade-offs for each algorithm.
Weaknesses / Questions
-
Some of the charts were a bit difficult to follow. Font sizes on some of the charts were extremely difficult to read.
-
I'm personally very familiar with A*, having used it extensively in game development. The design section for their algorithm could have been way better. Some of the information from Appendix D could have been pulled into section 4.2 to beef it up
-
The paper shows that the MILP method can reach hours to solve for larger topologies. It's not clear how this translates to real-time or dynamic workloads. It appears this would be best suited for static single-tenant workloads. Perhaps they should have introduced that factor as well when an operator is deciding on which of the 3 algorithms is best suited for their specific environment.
Related Work
-
Traffic Engineering
-
MILP problems