Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Venue: SIGCOMM 2024
Year: 2024
Authors: Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, Luke Marshall
Topic: ML

🌟 Highlights

I like their framing of collective communication as a multi-commodity flow (MCF) problem as well as their general MILP solution formulation. The support of multi-cast/copy along with utilization of buffers for store and forwarding is also pretty cool.
Overall a pretty solid paper and enjoyable to read

📝 Summary

Collective communication is a fundemental operation in distributed machine learning that optimizes data exchange across multiple GPUs beyond simple unicast communication. However, existing solutions face significant scalability challenges. This paper represents a novel traffic-engineering inspired framework that models the collective communication based off of a multi-commodity flow problem augmented with both (a) copy, (b) store and forward buffering. They offer 3 solutions that vary in the optimality-scalability tradeoff. The objective is finishing the traffic as quickly as possible.

🧩 Key Contributions

Introduction of three solution algorithms for collective communication efficiency improvement:
1. MILP (Mixed-integer linear programming) - The most general and accurate solution which supports multicast and copy.
1. LP (Linear Programming) - Omits the copy part to improve scalability
1. A*-based heuristic - Has the best scalability at the expense of optimality. Also supports copy.

✅ Strengths

Solid description of each algorithm as well as formulation of the solution
Clear connection and description of the trade-offs for each algorithm.

⚠️ Weaknesses / Questions

Some of the charts were a bit difficult to follow. Font sizes on some of the charts were extremely difficult to read.
I'm personally very familiar with A*, having used it extensively in game development. The design section for their algorithm could have been way better. Some of the information from Appendix D could have been pulled into section 4.2 to beef it up
The paper shows that the MILP method can reach hours to solve for larger topologies. It's not clear how this translates to real-time or dynamic workloads. It appears this would be best suited for static single-tenant workloads. Perhaps they should have introduced that factor as well when an operator is deciding on which of the 3 algorithms is best suited for their specific environment.

🔍 Related Work

Traffic Engineering
MILP problems

📄 Attachments

PDF: 📄 View PDF
Paper Link: 🔗 External Page