Atom: Low-bit quantization for efficient and accurate llm serving Y Zhao, CY Lin, K Zhu, Z Ye, L Chen, S Zheng, L Ceze, A Krishnamurthy, ... Proceedings of Machine Learning and Systems 6, 196-209, 2024 | 83 | 2024 |
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference J Tang, Y Zhao, K Zhu, G Xiao, B Kasikci, S Han arXiv preprint arXiv:2406.10774, 2024 | 26 | 2024 |
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models K Kamahori, Y Gu, K Zhu, B Kasikci arXiv preprint arXiv:2402.07033, 2024 | 10 | 2024 |
Nanoflow: Towards optimal large language model serving throughput K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D Xie, Y Gao, Q Xu, T Tang, Z Ye, ... arXiv preprint arXiv:2408.12757, 2024 | 9 | 2024 |
Atom: Low-bit quantization for efficient and accurate llm serving, 2024 Y Zhao, CY Lin, K Zhu, Z Ye, L Chen, S Zheng, L Ceze, A Krishnamurthy, ... URL https://arxiv. org/abs/2310.19102, 0 | 7 | |
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching Y Zhao, S Yang, K Zhu, L Zheng, B Kasikci, Y Zhou, J Xing, I Stoica arXiv preprint arXiv:2411.16102, 2024 | 1 | 2024 |
Can Storage Devices be Power Adaptive? D Xie, T Stavrinos, K Zhu, S Peter, B Kasikci, T Anderson Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File …, 2024 | | 2024 |