Skip to main content

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Venue
SIGCOMM 2024
Year
2024
Authors
Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, Luke Marshall
Topic
ML

🌟 Highlights

  • I like their framing of collective communication as a multi-commodity flow (MCF) problem as well as their general MILP solution formulation. The support of multi-cast/copy along with utilization of buffers for store and forwarding is also pretty cool.

  • Overall a pretty solid paper and enjoyable to read

📝 Summary

Collective communication is a fundemental operation in distributed machine learning that optimizes data exchange across multiple GPUs beyond simple unicast communication. However, existing solutions face significant scalability challenges. This paper represents a novel traffic-engineering inspired framework that models the collective communication based off of a multi-commodity flow problem augmented with both (a) copy, (b) store and forward buffering. They offer 3 solutions that vary in the optimality-scalability tradeoff. The objective is finishing the traffic as quickly as possible.

🧩 Key Contributions

  • Introduction of three solution algorithms for collective communication efficiency improvement:

    1. MILP (Mixed-integer linear programming) - The most general and accurate solution which supports multicast and copy.
    1. LP (Linear Programming) - Omits the copy part to improve scalability
    1. A*-based heuristic - Has the best scalability at the expense of optimality. Also supports copy.

Strengths

  • Solid description of each algorithm as well as formulation of the solution

  • Clear connection and description of the trade-offs for each algorithm.

⚠️ Weaknesses / Questions

  • Some of the charts were a bit difficult to follow. Font sizes on some of the charts were extremely difficult to read.

  • I'm personally very familiar with A*, having used it extensively in game development. The design section for their algorithm could have been way better. Some of the information from Appendix D could have been pulled into section 4.2 to beef it up

  • The paper shows that the MILP method can reach hours to solve for larger topologies. It's not clear how this translates to real-time or dynamic workloads. It appears this would be best suited for static single-tenant workloads. Perhaps they should have introduced that factor as well when an operator is deciding on which of the 3 algorithms is best suited for their specific environment.

🔍 Related Work

  • Traffic Engineering

  • MILP problems

📄 Attachments

PDF
📄 View PDF
Paper Link
🔗 External Page