Congestion control for cross-datacenter networks

G Zeng, W Bai, G Chen, K Chen, D Han… - IEEE/ACM …, 2022 - ieeexplore.ieee.org
Geographically distributed applications hosted on cloud are becoming prevalent. They run
on cross-datacenter network that consists of multiple data center networks (DCNs) …

Flow optimization strategies in data center networks: A survey

Y Liu, T Yu, Q Meng, Q Liu - Journal of Network and Computer Applications, 2024 - Elsevier
In the era of digitization, Data Center Networks (DCN) have emerged as a critical component
supporting infrastructure for cloud computing, big data analytics, online services, and more …

Masking corruption packet losses in datacenter networks with link-local retransmission

R Joshi, CH Song, XZ Khooi, N Budhdev… - Proceedings of the …, 2023 - dl.acm.org
Packet loss due to link corruption is a major problem in large warehouse-scale datacenters.
The current state-of-the-art approach of disabling corrupting links is not adequate because …

RateMP: Optimizing Bandwidth Utilization with High Burst Tolerance in Data Center Networks

J Han, K Xue, W Wang, R Li, Q Sun… - IEEE INFOCOM 2024 …, 2024 - ieeexplore.ieee.org
Load balancing in data center networks (DCNs) is a crucial and complex undertaking. Multi-
path TCP (MPTCP) has been proposed as a cost-effective solution that aims to distribute …

L2bm: Switch buffer management for hybrid traffic in data center networks

Y Liu, J Han, K Xue, R Li, J Li - 2023 IEEE 43rd International …, 2023 - ieeexplore.ieee.org
With Remote Direct Memory Access (RDMA) extended to commercial Ethernet, modern Data
Center Networks (DCNs) carry both traditional TCP and RDMA, to support diversified …

R-PFC: Enhancing RDMA Network With Restricted And Fine-grained PFC

X Li, M Li, X Ai, Y Gao, J Shao, Z Chen… - 2024 IEEE/ACM 32nd …, 2024 - ieeexplore.ieee.org
RDMA over Converged Ethernet (RoCE) has been widely used in datacenter networks and
it relies on Priority Flow Control (PFC) to ensure a lossless network. However, PFC brings …

DDT: Dynamical Selective Dropping Threshold for Reactive Congestion Control

H Zhou, D Hu, Z Zhou, G Yuan, D Dong - Proceedings of the ACM Turing …, 2024 - dl.acm.org
Traditional congestion control algorithms (CCAs) frequently struggle to manage microbursts,
resulting in performance degradation. Although RoCEv2 (RDMA over Converged Ethernet …

Cloud Burst Prediction System using Machine Learning

G Nagappan - … Conference (OTCON) on Smart Computing for …, 2024 - ieeexplore.ieee.org
This abstract presents an innovative approach that leverages the Gramian Angular Field
(GAF) in conjunction with Convolutional Neural Networks (CNN) to improve the accuracy …

Towards Efficient and Effective Distributed Training and Inference System for Large-Scale Machine Learning

W Wang - 2023 - search.proquest.com
Abstract effectively extract knowledge from large-scale training data. Such large models
have achieved state-of-art accuracy results in various tasks, including but not limited to …

DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

W Wang, C Zhang, L Yang, K Chen, K Tan - arXiv preprint arXiv …, 2020 - arxiv.org
Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in
today's production clusters. However, due to the global synchronization nature, its …