Fairness in Serving Large Language Models

Venue: OSDI
Year: 2024
Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
Topic: ML

🌟 Highlights

Introduces Virtual Token Counter (VTC), a fair scheduler tailored for multi-client serving
Fair scheduler works at token-level granularity within continuous batching
Provides formal fairness guarantees

📝 Summary

Fair scheduling for LLM model serving is a relatively unexplored area. Addressing this gap, this paper introduces the Virtual Token Counter (VTC), a fair scheduling algorithm designed specifically for continuous batching in LLM inference. VTC operates on token-level granularity while providing formal fairness guarantees despite unknown request lengths and dynamic batching constraints. Through empirical and theoretical evaluation, the authors demonstrate that VTC outperforms common baselines in both fairness and resource utilization.

🧩 Key Contributions

Domain specific adaptation of deficit-based scheduling on multi-client LLM serving.
Formal guarantees of fairness bounds

✅ Strengths

The integration of their scheduler into a continuous batching framework is pretty cool.
Addition of formal guarantees of theoretical fairness bounds is a nice touch.

⚠️ Weaknesses / Questions

This whole thing feels very similar to CFS. They compare to classic schedulers like CFS and DDR, but leave preemption to a future work. It seems framed as an engineering choice.
Lot of sloppy claims (See below):
"Today's LLM serving systems typically use first-come-first-serve (FCFS)" vLLM actually uses best fit, not FCFS. Further, TGI (their other reference) selects based on similar sequence lengths, not FCFS. Claude, Gemini, etc state they use priority based admission in their blogs.
"fair queuing in networking is typically applied to bit granularity, rather than packet granularity" Network schedulers typically work on packets. WFQ is literally packet-based.
"This is the first work to discuss the fair serving of LLM to the best of our knowledge" Their previous work (S-LoRA) literally talks exactly about that. vLLM talks about multi-tenant fairness. Themis, Pollux, InferFair, and Planaria all deal with model serving fairness, but to the author's credit, they are more general, not -explicitly- LLM-based.
"“The characteristic of unknown output length before finishing a request prevents a direct adaptation of classical algorithms like SFQ and Deficit Round Robin (DRR) into the LLM serving.” Classical schedulers also address the issue of unknown job lengths. However, unlike VTC, CFS supports preemption to enable fine-grained fairness.

🔍 Related Work

Continuous Batching

📄 Attachments

PDF: 📄 View PDF
Code: 🧑‍💻 GitHub Repository
Paper Link: 🔗 External Page