FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

H Jamil, A Alim, L Schares, P Maniotis, L Schour… - arXiv preprint arXiv …, 2024 - arxiv.org
The increasing complexity of AI workloads, especially distributed Large Language Model
(LLM) training, places significant strain on the networking infrastructure of parallel data …

[PDF][PDF] PARALEON: Automatic and Adaptive Tuning for DCQCN Parameters in RDMA Networks

Z Chen, M Zhang, G Li, J Cao, Y Jing, M Xu, R Xie… - zhangmenghao.github.io
RDMA is a kernel-bypass and transport-offload technology that provides high throughput
and low delay for datacenter networks, and DCQCN is the default and most widely used …

[PDF][PDF] 2FA Sketch: Two-Factor Armor Sketch for Accurate and Efficient Heavy Hitter Detection in Data Streams

X Liu, X Zhang, B Liu, T Li, T Yang, G Xie - yangtonghome.github.io
Detecting heavy hitters, which are flows exceeding a specified threshold, is crucial for
network measurement, but it faces challenges due to increasing throughput and memory …