Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization

L Chen, J Lingys, K Chen, F Liu - Proceedings of the 2018 conference of …, 2018 - dl.acm.org
Traffic optimizations (TO, eg flow scheduling, load balancing) in datacenters are difficult
online decision-making problems. Previously, they are done with heuristics relying on …

Language-directed hardware design for network performance monitoring

S Narayana, A Sivaraman, V Nathan, P Goyal… - Proceedings of the …, 2017 - dl.acm.org
Network performance monitoring today is restricted by existing switch support for
measurement, forcing operators to rely heavily on endpoints with poor visibility into the …

Flow event telemetry on programmable data plane

Y Zhou, C Sun, HH Liu, R Miao, S Bai, B Li… - Proceedings of the …, 2020 - dl.acm.org
Network performance anomalies (NPAs), eg long-tailed latency, bandwidth decline, etc., are
increasingly crucial to cloud providers as applications are getting more sensitive to …

A survey on big data for network traffic monitoring and analysis

A D'Alconzo, I Drago, A Morichetta… - … on Network and …, 2019 - ieeexplore.ieee.org
Network Traffic Monitoring and Analysis (NTMA) represents a key component for network
management, especially to guarantee the correct operation of large-scale networks such as …

Next-generation data center network enabled by machine learning: Review, challenges, and opportunities

H Dong, A Munir, H Tout, Y Ganjali - IEEE Access, 2021 - ieeexplore.ieee.org
Data center network (DCN) is the backbone of many emerging applications from smart
connected homes to smart traffic control and is continuously evolving to meet the diverse …

Diagnosing root causes of intermittent slow queries in cloud databases

M Ma, Z Yin, S Zhang, S Wang, C Zheng… - Proceedings of the …, 2020 - dl.acm.org
With the growing market of cloud databases, careful detection and elimination of slow
queries are of great importance to service stability. Previous studies focus on optimizing the …

From luna to solar: the evolutions of the compute-to-storage networks in alibaba cloud

R Miao, L Zhu, S Ma, K Qian, S Zhuang, B Li… - Proceedings of the …, 2022 - dl.acm.org
This paper presents the two generations of storage network stacks that reduced the average
I/O latency of Alibaba Cloud's EBS service by 72% in the last five years: Luna, a user-space …

{NetBouncer}: Active device and link failure localization in data center networks

C Tan, Z Jin, C Guo, T Zhang, H Wu, K Deng… - … USENIX Symposium on …, 2019 - usenix.org
The availability of data center services is jeopardized by various network incidents. One of
the biggest challenges for network incident handling is to accurately localize the failures …

007: Democratically finding the cause of packet drops

B Arzani, S Ciraci, L Chamon, Y Zhu, HH Liu… - … USENIX Symposium on …, 2018 - usenix.org
Network failures continue to plague datacenter operators as their symptoms may not have
direct correlation with where or why they occur. We introduce 007, a lightweight, always-on …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …