Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic …
S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Transfer learning in natural language processing (NLP) uses increasingly large models that tackle challenging problems. Consequently, these applications are driving the requirements …
We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose …
S Pati, S Aga, M Islam, N Jayasena… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques …
Chiplets are transforming computer system designs, allowing system designers to combine heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
S Pati, S Aga, N Jayasena, MD Sinclair - arXiv preprint arXiv:2104.08335, 2021 - arxiv.org
Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved …
P Gong, R Liu, Z Mao, Z Cai, X Yan, C Li… - Proceedings of the 29th …, 2023 - dl.acm.org
Graph sampling prepares training samples for graph learning and can dominate the training time. Due to the increasing algorithm diversity and complexity, existing sampling frameworks …
S Pati, S Aga, M Islam, N Jayasena… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling also increased the reliance on efficient distributed training techniques …
S Pati, S Aga, N Jayasena, MD Sinclair - arXiv preprint arXiv:2409.02227, 2024 - arxiv.org
Modern accelerators like GPUs are increasingly executing independent operations concurrently to improve the device's compute utilization. However, effectively harnessing it …