Large-scale training is important to ensure high performance and accuracy of machine- learning models. At Facebook we use many different models, including computer vision …
Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at …
At Facebook, machine learning provides a wide range of capabilities that drive many aspects of user experience including ranking posts, content understanding, object detection …
The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely …
At-scale evaluation of new data center network innovations is becoming increasingly intractable. This is true for testbeds, where few, if any, can afford a dedicated, full-scale …
A Roy, H Zeng, J Bagga, G Porter… - Proceedings of the 2015 …, 2015 - dl.acm.org
Large cloud service providers have invested in increasingly larger datacenters to house the computing infrastructure required to support their services. Accordingly, researchers and …
J Gao, R Jamidar - Google White Paper, 2014 - research.google.com
The rapid adoption of Internetenabled devices, coupled with the shift from consumerside computing to SaaS and cloudbased systems, is accelerating the growth of largescale data …
A Sriraman, A Dhanotia - Proceedings of the Twenty-Fifth International …, 2020 - dl.acm.org
At global user population scale, important microservices in warehouse-scale data centers can grow to account for an enormous installed base of servers. With the end of Dennard …
In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …