Fairness in Serving Large Language Models
Highlights
-
Introduces Virtual Token Counter (VTC), a fair scheduler tailored for multi-client serving
-
Fair scheduler works at token-level granularity within continuous batching
-
Provides formal fairness guarantees
Summary
Fair scheduling for LLM model serving is a relatively unexplored area. Addressing this gap, this paper introduces the Virtual Token Counter (VTC), a fair scheduling algorithm designed specifically for continuous batching in LLM inference. VTC operates on token-level granularity while providing formal fairness guarantees despite unknown request lengths and dynamic batching constraints. Through empirical and theoretical evaluation, the authors demonstrate that VTC outperforms common baselines in both fairness and resource utilization.
Key Contributions
-
Domain specific adaptation of deficit-based scheduling on multi-client LLM serving.
-
Formal guarantees of fairness bounds
Strengths
-
The integration of their scheduler into a continuous batching framework is pretty cool.
-
Addition of formal guarantees of theoretical fairness bounds is a nice touch.
Weaknesses / Questions
-
This whole thing feels very similar to CFS. They compare to classic schedulers like CFS and DDR, but leave preemption to a future work. It seems framed as an engineering choice.
-
Lot of sloppy claims (See below):
-
"Today's LLM serving systems typically use first-come-first-serve (FCFS)" vLLM actually uses best fit, not FCFS. Further, TGI (their other reference) selects based on similar sequence lengths, not FCFS. Claude, Gemini, etc state they use priority based admission in their blogs.
-
"fair queuing in networking is typically applied to bit granularity, rather than packet granularity" Network schedulers typically work on packets. WFQ is literally packet-based.
-
"This is the first work to discuss the fair serving of LLM to the best of our knowledge" Their previous work (S-LoRA) literally talks exactly about that. vLLM talks about multi-tenant fairness. Themis, Pollux, InferFair, and Planaria all deal with model serving fairness, but to the author's credit, they are more general, not -explicitly- LLM-based.
-
"βThe characteristic of unknown output length before finishing a request prevents a direct adaptation of classical algorithms like SFQ and Deficit Round Robin (DRR) into the LLM serving.β Classical schedulers also address the issue of unknown job lengths. However, unlike VTC, CFS supports preemption to enable fine-grained fairness.
Related Work
- Continuous Batching