Machine learning is widely used to solve networking challenges, ranging from traffic classification and anomaly detection to network configuration. However, machine learning …
Data center clusters that run DNN training jobs are inherently heterogeneous. They have GPUs and CPUs for computation and network bandwidth for distributed training. However …
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a …
Traditionally, the data plane has been designed with fixed functions to forward packets using a small set of protocols. This closed-design paradigm has limited the capability of the …
Distributed deep neural network training (DT) systems are widely deployed in clusters where the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …
Z Xiong, N Zilberman - Proceedings of the 18th ACM workshop on hot …, 2019 - dl.acm.org
Machine learning is currently driving a technological and societal revolution. While programmable switches have been proven to be useful for in-network computing, machine …
Memory disaggregation promises transparent elasticity, high resource utilization and hardware heterogeneity in data centers by physically separating memory and compute into …
Programmable data plane technologies enable the systematic reconfiguration of the low- level processing steps applied to network packets and are key drivers toward realizing the …