You only cache once: Decoder-decoder architectures for language models Y Sun, L Dong, Y Zhu, S Huang, W Wang, S Ma, Q Zhang, J Wang, F Wei arXiv preprint arXiv:2405.05254, 2024 | 20 | 2024 |
Differential Transformer T Ye, L Dong, Y Xia, Y Sun, Y Zhu, G Huang, F Wei arXiv preprint arXiv:2410.05258, 2024 | 7 | 2024 |
{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training Z Lin, Y Miao, Q Zhang, F Yang, Y Zhu, C Li, S Maleki, X Cao, N Shang, ... 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI …, 2024 | 6 | 2024 |