Punica: Multi-Tenant LoRA Serving
Highlights
-
The CUDA kernel algorithm is novel (I built a python POC below since the paper didn't do a good job explaining it)
-
Paper was interesting, but could have used a lot more polish. I was slightly disappointed reading it, especially since there were cool elements to it.
Summary
Punica addresses the issue of serving an arbitrary mixture of LoRAs with a similar backend to multiple clients. It's core contribution is the SGMV kernel which allows for this to happen in an efficient way.
Key Contributions
A system that (almost verbatim from paper):
-
Indentifies opportunity for batch processing requests of multiple, different LoRA models
-
Designs and implements an efficient CUDA kernel for running multiple LoRA models concurrently.
-
New scheduling mechanisms to consolidate multi-tenant LoRA workloads.
Strengths
-
Contributions are extremely clear and flow into other parts of the paper cleanly. The paper itself is very approachable
-
Batching for an arbitrary mixture of different LoRA with the same backend is interesting and novel. It segments the input batch by LoRA adapter on a per-request basis. That's the most interesting contribution in this paper.
Weaknesses / Questions
-
Most annoyingly, the SGMV is really cool, but the paper does not land a clear and solid description and explanation of it in the design section. It took me way too long to understand it, which is so upsetting, that I worked it out and built a python implementation of it below.
-
The roofline plot on Figure 7 is pretty sub-standard. There's a lot of wasted space, it's impossible to see the lines where they converge on FLOP = 6, and the legend is vague. Given that the entirety of section 7.1 flows from this chart, and the modifications to make that chart more polished are minor, it's a bit a of a miss.
-
The writing is wildly inconsistent. Many parts could be further polished or re-written as they feel like a one-shot pass. Example: Section 2.1
-
Figure 8 shows SGMV underperforming BMM which is fine if the authors at least justified why one of the baselines performs better than their system. The figure is simply presented without explaining that abnormality. They at least justify lack of performance in other charts, or I can trace it back to why it behaves the way it does.
Related Work
- Mention related papers, comparisons
Algorithm Example
import numpy as np
import time
# Configurable parameters
n = 64 # batch size
d = 64 # input dimension
h = 32 # output dimension
r = 8 # LoRA rank
# Create random data
np.random.seed(42)
X = np.random.randn(n, d) # (n x d)
W = np.random.randn(d, h) # (d x h)
A_list = [np.random.randn(d, r) for _ in range(n)]
B_list = [np.random.randn(r, h) for _ in range(n)]
# Show Naive per-request loop for LoRA computation
start = time.time()
Y_naive = np.zeros((n, h))
for i in range(n):
x_i = X[i] # (d,)
A_i = A_list[i] # (d x r)
B_i = B_list[i] # (r x h)
y_backbone = x_i @ W # (h,)
y_lora = (x_i @ A_i) @ B_i
Y_naive[i] = y_backbone + y_lora
naive_time = time.time() - start
# Show batched backbone + SGMV-style LoRA computation
start = time.time()
Y_backbone = X @ W # (n x h)
Y_lora = np.zeros((n, h))
for i in range(n):
Y_lora[i] = (X[i] @ A_list[i]) @ B_list[i]
Y_sgmv = Y_backbone + Y_lora
sgmv_time = time.time() - start
# Compare correctness (sanity check)
assert np.allclose(Y_naive, Y_sgmv, atol=1e-8), "Results do not match!"
# Show results and benchmark times
print(f"Naive per-request loop: {naive_time*1000:.2f} ms")
print(f"SGMV-style (batched backbone + per-request LoRA): {sgmv_time*1000:.2f} ms")
print(f"SGMV is {naive_time/sgmv_time:.2f}x faster (in Python; real CUDA is even more pronounced)")
print("\nSample output (first row):\n", np.round(Y_sgmv[0], 3))
(sandbox) user@machine:~/sandbox$ python test.py
Naive per-request loop: 0.29 ms
SGMV-style (batched backbone + per-request LoRA): 0.18 ms
SGMV is 1.63x faster (in Python; real CUDA is even more pronounced)
Sample output (first row):
[ 5.833 55.263 -20.852 35.888 11.676 ..... ]