Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters Q Hu, P Sun, S Yan, Y Wen, T Zhang Proceedings of the International Conference for High Performance Computing …, 2021 | 98 | 2021 |
Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo, T Zhang, Y Wen arXiv preprint arXiv:2205.11913, 2022 | 24 | 2022 |
Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs Q Hu, M Zhang, P Sun, Y Wen, T Zhang Proceedings of the 28th ACM International Conference on Architectural …, 2023 | 12 | 2023 |
Characterization of large language model development in the datacenter Q Hu, Z Ye, Z Wang, G Wang, M Zhang, Q Chen, P Sun, D Lin, X Wang, ... 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI …, 2024 | 11 | 2024 |
Boosting distributed full-graph gnn training with asynchronous one-bit communication M Zhang, Q Hu, P Sun, Y Wen, T Zhang arXiv preprint arXiv:2303.01277, 2023 | 6 | 2023 |
Deep learning workload scheduling in gpu datacenters: A survey Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, Y Wen ACM Computing Surveys 56 (6), 1-38, 2024 | 4 | 2024 |
Hydro:{Surrogate-Based} Hyperparameter Tuning Service in Datacenters Q Hu, Z Ye, M Zhang, Q Chen, P Sun, Y Wen, T Zhang 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI …, 2023 | 4 | 2023 |
Primo: Practical Learning-Augmented Systems with Interpretable Models Q Hu, H Nori, P Sun, Y Wen, T Zhang 2022 USENIX Annual Technical Conference (USENIX ATC 22), 519-538, 2022 | 4 | 2022 |
Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding Q Chen, D Gu, G Wang, X Chen, YT Xiong, T Huang, Q Hu, X Jin, Y Wen, ... arXiv preprint arXiv:2401.09149, 2024 | 3 | 2024 |
FedDSE: Distribution-aware Sub-model Extraction for Federated Learning over Resource-constrained Devices H Wang, Y Jia, M Zhang, Q Hu, H Ren, P Sun, Y Wen, T Zhang Proceedings of the ACM on Web Conference 2024, 2902-2913, 2024 | | 2024 |
AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning Q Chen, Q Hu, Z Ye, G Wang, P Sun, Y Wen, T Zhang arXiv preprint arXiv:2311.00257, 2023 | | 2023 |
Building efficient and practical machine learning systems Q Hu Nanyang Technological University, 2023 | | 2023 |
Understanding the Workload Characteristics of Large Language Model Development Q Hu, P Sun, T Zhang | | |