AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration

Venue: MLSys
Year: 2024
Authors: Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han
Topic: LLM Systems on Edge Devices

🌟 Highlights

Awarded Best Paper in MLSys 2024.
It's a novel and interesting post-training quantization method for getting LLMs to function on edge devices.
Generally, a pretty solid paper. It's a bit of a denser paper, but it might partially be due to my lack of familiarity

📝 Summary

Large Language Models are historically massive and runnable only on large-scale computational platforms like the cloud. Running some models on the edge is often prohibitive due both (1) lack of resources edge devices possess and, (2) sheer model size. AWQ is a post-training quantization method for LLMs that minimizes quantization error and achieves higher accuracy at low-bit quantization without the need for a large calibration set or backpropogation. This makes AWQ suitable for resource-constrained environments such as edge devices.

🧩 Key Contributions

AWQ (Activation-aware Weight Quantization), "a hardware-friendly low-bit weight-only quantization method for LLMs"
Activation aware, not weight-aware. AWQ identifies a small fraction (0.1%-1.0%) of "salient" weight channels that are way more important for LLM accuracy than other weights. Interestingly, saliency is determined by activation magnitude versus weight norm. Thus, the technique of scaling those channels up perserves model accuracy.
AWQ uses a per-channel scaling optimzation that does not require backpropagation. It's claimed to be very data efficient.
There's an interesting low-level part to this paper where they use 4-bit weight packing, and SIMD-aware bitpacking which grants them a lot of speedups.They also use Kernel Fusion to reduce operations to single kernels.
TinyChat, an open-source LLM system that runs on constrained edges. I couldn't tell, but I think that this system might be a separate paper altogether.

✅ Strengths

The salient weight channels idea and approach is interesting and fairly novel. Making it activation-aware versus weight-aware is also interesting.
Method seems robust to calibration sets that are both smaller have different calibration set distributions. Interestingly, they claim lower perplexity too due to data efficiency.
The inclusion of the low-level and hardware optimization techniques is a nice touch.

⚠️ Weaknesses / Questions

The paper extensively uses Perplexity as a defining metric for their quantization, but this doesn't guarantee that tasks like math or code, which also require correctness (possibly at the cost of higher fidelity) perform as well. It does however validate on GSM8K and MBPP which might be fine for those tasks though. The paper does also seem to acknowledge the need for correctness and not just fidelity in at least one area.
The kernel fusion and SIMD-aware bitpacking are major reasons for the practical speedup. I couldn't find an area in the paper that showed how the effect of the other components on the system without the SIMD packing.

🔍 Related Work

TinyChat, the system they've also designed.

📄 Attachments

PDF: 📄 View PDF
Code: 🧑‍💻 GitHub Repository
Paper Link: 🔗 External Page