SmoothQuant: Accurate and efficient post-training quantization for large language models G Xiao, J Lin, M Seznec, H Wu, J Demouth, S Han International Conference on Machine Learning, 38087-38099, 2023 | 386 | 2023 |
Awq: Activation-aware weight quantization for llm compression and acceleration J Lin, J Tang, H Tang, S Yang, WM Chen, WC Wang, G Xiao, X Dang, ... MLSys 2024, 2023 | 237* | 2023 |
Efficient streaming language models with attention sinks G Xiao, Y Tian, B Chen, S Han, M Lewis International Conference on Learning Representations (ICLR), 2024 | 156 | 2024 |
Fastcomposer: Tuning-free multi-subject image generation with localized attention G Xiao, T Yin, WT Freeman, F Durand, S Han arXiv preprint arXiv:2305.10431, 2023 | 88 | 2023 |
Qserve: W4a8kv4 quantization and system co-design for efficient llm serving Y Lin, H Tang, S Yang, Z Zhang, G Xiao, C Gan, S Han arXiv preprint arXiv:2405.04532, 2024 | 4 | 2024 |
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference J Tang, Y Zhao, K Zhu, G Xiao, B Kasikci, S Han arXiv preprint arXiv:2406.10774, 2024 | 1 | 2024 |